The Big Data Bible (Well Almost!)

The Big Data Bible (Well Almost!)

The Big Data Bible (Well Almost!)

Over the last year, I’ve been working with several new customers implementing strategy and capabilities for Business Intelligence and Big Data.  What occurred to me over that time, is that a lot of my clients kept asking key questions about the meaning of things within this space.  For example, in one situation I implemented a predictive analytics solution, and there were many questions about specific terms such as Hadoop, R, spark etc. These terms were bandied about by suppliers and my clients were often left stranded on the island of confusion.

So as an ode to my clients I put together a list of terms that they could refer to – a mighty long list too!  I’m not going to say much more apart from acknowledge a few references – thanks to: Wikipedia, datafloq, Teradata, techrepublic and some I’ve even written myself!

I know I haven’t got everything listed, so please add those that are missing in the comments section.  Hey we may even get to “the” list of terms!


ACID test – A test applied to data for atomicity, consistency, isolation, and durability

Aggregation – a process of searching, gathering and presenting data
Algorithms – a mathematical formula that can perform certain analyses on data
Analytics – the discovery of insights in data

Analytics Platform – is a full-featured technology solution designed to address the needs of large enterprises. Typically, it joins different tools and analytics systems together with an engine to execute, a database or repository to store and manage the data, data mining processes, and techniques and mechanisms for obtaining and preparing data that is not stored. This solution can be conveyed as a software-only application or as a cloud-based software as a service (SaaS) provided to organisations in need of contextual information that all their data points to, in other words, analytical information based on current data records.
Anomaly detection – the search for data items in a dataset that do not match a projected pattern or expected behaviour. Anomalies are also called outliers, exceptions, surprises or contaminants and they often provide critical and actionable information.
Anonymization – making data anonymous; removing all data points that could lead to identify a person
Application – computer software that enables a computer to perform a certain task
Artificial Intelligence – developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions.

Automatic identification and capture (AIDC) – Any method of automatically identifying and collecting data on items, and then storing the data in a computer system. For example, a scanner might collect data about a product being shipped via an RFID chip.

Avro – Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing remote procedure calls


Behavioural Analytics – analytics that informs about the how, why and what instead of just the who and when. It looks at humanized patterns in the data

Big data – is an all-encompassing term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data-processing applications

Big Data Analytics – Big data analytics refers to the strategy of analysing large volumes of data; gathered from a wide variety of sources, including social networks, videos, digital images, sensors and sales transaction records. The aim in analysing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.
Big Data Scientist – someone who is able to develop the algorithms to make sense out of big data
Big data start-up – a young company that has developed new big data technology
Biometrics – the identification of humans by their characteristics
Brontobytes – approximately 1000 Yottabytes and the size of the digital universe tomorrow. A Brontobyte contains 27 zeros
Business Intelligence – Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimise decisions and performance.


Cascading – Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.

Call Detail Record (CDR) analysis – CDRs contain data that a telecommunications company collects about phone calls, such as time and length of call. This data can be used in any number of analytical applications.

Cassandra – Cassandra is a distributed and Open Source database. Designed to handle large amounts of distributed data across commodity servers while providing a highly available service. It is a NoSQL solution that was initially developed by Facebook. It is structured in the form of key-value.

Chukwa – Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness.  Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.

Classification analysis – a systematic process for obtaining important and relevant information about data, also meta data called; data about data.
Clojure – Clojure is a dynamic programming language based on LISP that uses the Java Virtual Machine (JVM). It is well suited for parallel data processing.

Clickstream Analytics – The analysis of users’ Web activity through the items they click on a page.

Cloud computing – a distributed computing system over a network used for storing data off-premises
Clustering analysis – the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.
Cold data storage – storing old data that is hardly used on low-power servers. Retrieving the data will take longer
Columnar database or column-oriented database – A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address, and phone number. In a column-oriented database, all names are in one column, addresses in another, and so on. A key advantage of a columnar database is faster hard disk access.

Comparators – Two ways you may compare your keys is by implementing the interface or by implementing the RawComparator interface. In the former approach, you will compare (deserialized) objects, but in the latter approach, you will compare the keys using their corresponding raw bytes.

Comparative analysis – it ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.
Complex event processing (CEP) – CEP is the process of monitoring and analyzing all events across an organization’s systems and acting on them when necessary in real time.

Complex structured data – data that are composed of two or more complex, complicated, and interrelated parts that cannot be easily interpreted by structured query languages and tools.
Computer generated data – data generated by computers such as log files
Confabulation – The act of making an intuition-based decision appear to be data-based.

Concurrency – performing and executing multiple tasks and processes at the same time

Connection Analytics (sometimes called “Link Analysis”) – Connection analytics is an emerging discipline that helps to discover interrelated connections and influences between people, products, processes machines and systems within a network by mapping those connections and continuously monitoring interactions between them. It has been used to address difficult and persistent business questions relating to, for instance, the influence of thought leaders, the impact of external events or players on financial risk, and the causal relationships between nodes in assessing network performance.
Correlation analysis – the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).

Cross-channel analytics – Analysis that can attribute sales, show average order value, or the lifetime value.
Customer Relationship Management – managing the sales and business processes, big data will affect CRM strategies


Dark Data – This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Data access – The act or method of viewing or retrieving stored data.

Dashboard – a graphical representation of the analyses performed by the algorithms
Data aggregation – The act of collecting data from multiple sources for the purpose of reporting or analysis.

Data aggregation tools – the process of transforming scattered data from numerous sources into a single new one.

Database administrator (DBA) – A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.
Data analyst – someone analysing, modelling, cleaning or processing data
Data analytics – The science of examining data with software-based queries and algorithms with the goal of drawing conclusions about that information for business decision making.

Data architecture and design – How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: conceptual representation of business entities. the logical representation of the relationships among those entities, and the physical construction of the system to support the functionality.

Database – A database is an organized collection of data. It may include charts, schemas or tables.
Database-as-a-Service – a database hosted in the cloud on a pay per use basis, for example Amazon Web Services
Database Management System (DBMS) – collecting, storing and providing access of data
Data centre – a physical location that houses the servers for storing data

Data collection – Any process that captures any type of data.
Data cleansing – the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency
Data custodian – someone who is responsible for the technical environment necessary for data storage
Data-directed decision making – Using data to support making crucial decisions.

Data ethical guidelines – guidelines that help organizations being transparent with their data, ensuring simplicity, security and privacy
Data exhaust – The data that a person creates as a by-product of a common activity–for example, a cell call log or web search history.

Data feed – a stream of data such as a Twitter feed or RSS
Data governance – A set of processes or rules that ensure the integrity of the data and that data management best practices are met.

Data integration – The process of combining data from different sources and presenting it in a single view.

Data integrity – The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.

Data Lake – is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Data mart – The access layer of a data warehouse used to provide data to users.

Data marketplace – an online environment to buy and sell data sets
Data migration – The process of moving data between different storage types or formats, or between different computer systems.

Data mining – the process of finding certain patterns or information from data sets
Data modelling – the analysis of data objects using data modelling techniques to create insights from the data

Data point – An individual item on a graph or a chart.

Data profiling – The process of collecting statistics and information about data in an existing source.

Data quality – The measure of data to determine its worthiness for decision making, planning, or operations.

Data replication – The process of sharing information to ensure consistency between redundant sources.

Data repository – The location of permanently stored data.

Data science – A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

Data scientist – A practitioner of data science.
Data security – The practice of protecting data from destruction or unauthorized access.

Data set – a collection of data

Data source – Any provider of data–for example, a database or a data stream.

Data steward – A person responsible for data stored in a data field.

Data structure – A specific way of storing and organizing data.
Data virtualization – a data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)
Data visualization – A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.

Data warehouse – A place to store data for the purpose of reporting and analysis.

De-identification – same as anonymization; ensuring a person cannot be identified through the data
Demographic data – Data relating to the characteristics of a human population.

Deep Thunder – IBM’s weather prediction service that provides weather data to organizations such as utilities, which use the data to optimize energy distribution.

Descriptive Analytics – Considered the most basic type of analytics, descriptive analytics involves the breaking down of big data into smaller chunks of usable information so that companies can understand what happened with a specific operation, process or set of transactions. Descriptive analytics can provide insight into current customer behaviours and operational trends to support decisions about resource allocations, process improvements and overall performance management. Most industry observers believe it represents the vast majority of the analytics in use at companies today.

Discriminant analysis – cataloguing of the data; distributing data into groups, classes or categories. A statistical analysis used where certain groups or clusters in data are known upfront and that uses that information to derive the classification rule.
Distributed cache – A data cache that is spread across multiple systems but works as one. It is used to improve performance.

Distributed object – A software module designed to work with other distributed objects stored on other computers.

Distributed processing – The execution of a process across multiple computers connected by a computer network.

Distributed File System – systems that offer simplified, highly available access to storing, analysing and processing data
Document Store Databases – a document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.

Document management – The practice of tracking and storing electronic documents and scanned images of paper documents.

Drill – An open source distributed system for performing interactive analysis on large-scale datasets. It is similar to Google’s Dremel, and is managed by Apache.


Elasticsearch – An open source search engine built on Apache Lucene.

Event analytics – Shows the series of steps that led to an action.

Exploratory analysis – finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Exabytes – approximately 1000 petabytes or 1 billion gigabytes. Today we create one Exabyte of new information globally on a daily basis.

External data – Data that exists outside of a system.

Exploratory analysis – Finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Extract, Transform and Load (ETL) – a process in a database and data warehousing meaning extracting the data from various sources, transforming it to fit operational needs and loading it into the database


Failover – switching automatically to a different server or node should one fail
Fault-tolerant design – a system designed to continue working even if certain parts fail

Flume – Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.


Gamification – using game elements in a non-game context; very useful to create data therefore coined as the friendly scout of big data
Graph Databases – they use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
Grid computing – connecting different computer systems from various location, often via the cloud, to reach a common goal


Hadoop – an open-source framework that is built to enable the process and storage of big data across a distributed file system

Hama – Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. It’s a Top-Level Project under the Apache Software Foundation.

HANA – A software/hardware in-memory computing platform from SAP designed for high-volume transactions and real-time analytics.
HBase –an open source, non-relational, distributed database running in conjunction with Hadoop
HCatalog – HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

HDFS – Hadoop Distributed File System; a distributed file system designed to run on commodity hardware
High-Performance-Computing (HPC) – using supercomputers to solve highly complex and advanced computing problems

Hive – Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

Hue – Hue (Hadoop User Experience) is an open source web-based interface for making it easier to use Apache Hadoop. It features a file browser for HDFS, an Oozie Application for creating workflows and coordinators, a job designer/browser for MapReduce, a Hive and Impala UI, a Shell, a collection of Hadoop API and more.


In-database analytics – The integration of data analytics into the data warehouse.

In-memory data grid (IMDG) – The storage of data in memory across multiple servers for the purpose of greater scalability and faster access or analytics.

In-memory database – a database management system stores data on the main memory instead of the disk, resulting is very fast processing, storing and loading of the data

Internet of Things – ordinary devices that are connected to the internet at any time any where via sensors


Juridical data compliance – relevant when you use cloud solutions and where the data is stored in a different country or continent. Be aware that data stored in a different country has to oblige to the law in that country.


Kafka – Kafka (developed by LinkedIn) is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. This type of data (page views, searches, and other user actions) are a key ingredient in the current social web.

Key Value Stores – Key value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model.

KeyValue Databases – they store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.


Latency – a measure of time delayed in a system
Legacy system – An established computer system, application, or technology that continues to be used because of the value it provides to the enterprise.

Linked data – As described by World Wide Web inventor Time Berners-Lee, “Cherry-picking common attributes or languages to identify connections or relationships between disparate sources of data.”
Load balancing – distributing workload across multiple computers or servers in order to achieve optimal results and utilization of the system
Location analytics – Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with datasets.

Location data – GPS data describing a geographical location
Log file – a file automatically created by a computer to record events that occur while operational


Machine-generated data – Any data that is automatically created from a computer process, application, or other non-human source.

Machine2Machine data – two or more machines that are communicating with each other
Machine data – data created by machines via sensors or algorithms
Machine learning – part of artificial intelligence where machines learn from what they are doing and become better over time
Mahout – Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modelling and implements them using the Map Reduce model.

MapReduce – A big data batch processing framework that breaks up a data analysis problem into pieces that are then mapped and distributed across multiple computers on the same network or cluster, or across a grid of disparate and possibly geographically separated systems. The data analytics performed on this data are then collected and combined into a distilled or “reduced” report.
Mashup – The process of combining different datasets within a single application to enhance output –for example, combining demographic data with real estate listings.

Massively Parallel Processing (MPP) – using many different processors (or computers) to perform certain computational tasks at the same time
Metadata – data about data; gives information about what the data is about.
MongoDB – an open-source NoSQL database

MPP database – A database optimized to work in a massively parallel processing environment.
Multi-Dimensional Databases – a database optimized for data online analytical processing (OLAP) applications and for data warehousing.
MultiValue Databases – they are a type of NoSQL and multidimensional databases that understand 3-dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly

Multivariate analysis – is essentially the statistical process of simultaneously analysing. multiple independent (or predictor) variables with multiple dependent (outcome or criterion) variables. using matrix algebra (most multivariate analyses are correlational).


Natural Language Processing – a field of computer science involved with interactions between computers and human languages
Network analysis – viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.
Neural Network – Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NewSQL – an elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL
NoSQL – sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling.


Object Databases – they store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows object to be found with a declarative programming approach.
Object-based Image Analysis – analysing digital images can be performed with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.

Online analytical processing (OLAP) – The process of analysing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives).

Online transactional processing (OLTP) – The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.

Oozie – Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

OpenDremel – The open source version of Google’s Big Query java code. It is being integrated with Apache Drill.

Open Data Center Alliance (ODCA) – A consortium of global IT organizations whose goal is to speed the migration of cloud computing.
Operational Databases – they carry out regular operations of an organisation and are generally very important to a business. They generally use online transaction processing that allows them to enter, collect and retrieve specific information about the company.
Optimization analysis – the process of optimization during the design cycle of products done by algorithms. It allows companies to virtually design many different variations of a product and to test that product against pre-set variables.
Ontology – ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts
Outlier detection – an outlier is an object that deviates significantly from the general average within a dataset or a combination of data. It is numerically distant from the rest of the data and therefore, the outlier indicates that something is going on and generally therefore requires additional analysis.


Parallel data analysis – Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.

Parallel method invocation (PMI) – Allows programming code to call multiple functions in parallel.

Parallel processing – The ability to execute multiple tasks at the same time.

Parallel query – A query that is executed over multiple system threads for faster performance.

Pattern Recognition – identifying patterns in data via algorithms to make predictions of new data coming from the same source.

Pentaho – Pentaho offers a suite of open source Business Intelligence (BI) products called Pentaho Business Analytics providing data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities
Petabytes – approximately 1000 terabytes or 1 million gigabytes. The CERN Large Hydron Collider generates approximately 1 petabyte per second

Pig – Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).
Platform-as-a-Service – a services providing all the necessary infrastructure for cloud computing solutions
Predictive analytics – the most valuable analysis within big data as they help predict what someone is likely to buy, visit, do or how someone will behave in the (near) future. It uses a variety of different data sets such as historical, transactional, social or customer profile data to identify risks and opportunities.

Predictive modelling – The process of developing a model that will most likely predict a trend or outcome.

Prescriptive Analytics – A type or extension of predictive analytics, prescriptive analytics is used to recommend or prescribe specific actions when certain information states are reached or conditions are met. It uses algorithms, mathematical techniques and/or business rules to choose from amongst several different actions that are aligned to an objective (such as improving business performance) and that recognise various requirements or constraints.
Privacy – to seclude certain data / information about oneself that is deemed personal
Public data – public information or data sets that were created with public funding


Quantified Self – a movement to use application to track ones every move during the day to gain a better understanding about one’s behaviour
Query – asking for information to answer a certain question

Query analysis – The process of analysing a search query for the purpose of optimizing it for the best possible result.


R – R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

Re-identification – combining several data sets to find a certain person within anonymized data

Reference data – Data that describes an object and its properties. The object may be physical or virtual.
Regression analysis – to define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.
RFID – Radio Frequency Identification; a type of sensor using wireless non-contact radio-frequency electromagnetic fields to transfer data
Real-time data – data that is created, processed, stored, analysed and visualized within milliseconds
Recommendation engine – an algorithm that suggests certain products based on previous buying behaviour or buying behaviour of others

Risk analysis – The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision.

Root-cause analysis – The process of determining the main cause of an event or problem.
Routing analysis – finding the optimized routing using many different variables for a certain means of transport in order to decrease fuel costs and increase efficiency.


Scalability – The ability of a system or process to maintain acceptable performance levels as workload or scope increases.

Schema – The structure that defines the organization of data in a database system.

Search data – Aggregated data about search terms used over time.

Semi-structured data – a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records.
Sentiment Analysis – using algorithms to find out how people feel about certain topics

Server – A physical or virtual computer that serves requests for a software application and delivers those requests over a network.
Signal analysis – it refers to the analysis of measurement of time varying or spatially varying physical quantities to analyse the performance of a product. Especially used with sensor data.
Similarity searches – finding the closest object to a query in a database, where the data object can be of any type of data.
Simulation analysis – a simulation is the imitation of the operation of a real-world process or system. A simulation analysis helps to ensure optimal product performance taking into account many different variables.
Smart grid – refers to using sensors within an energy grid to monitor what is going on in real-time helping to increase efficiency
Software-as-a-Service – a software tool that is used of the web via a browser

Social network analysis [SNA] – is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes.
Spatial analysis – refers to analysing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.

Spark (Apache Spark) – An open-source computing framework originally developed at University of California, Berkeley. Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.
Structured Query Language (SQL) – A programming language designed specifically to manage and retrieve data from a relational database system.

Sqoop – Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Storage – Any means of storing data persistently.

Storm – Storm is a system of real-time distributed computing, open source and free, born into Twitter. Storm makes it easy to reliably process unstructured data flows in the field of real-time processing, which made Hadoop for batch processing.
Structured data – data that is identifiable as it is organized in structure like rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified.

System of record (SOR) data – Data that is typically found in fixed record lengths, with at least one field in the data record serving as a data key or access field. System of records data makes up company transaction files, such as orders that are entered, parts that are shipped, bills that are sent, and records of customer names and addresses.


Text analytics – The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.

Terabytes – approximately 1000 gigabytes. A terabyte can store up to 300 hours of high-definition video

Thrift – “Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.”
Time series analysis – analysing well-defined data obtained through repeated measurements of time. The data has to be well defined and measured at successive points in time spaced at identical time intervals.
Topological Data Analysis – focusing on the shape of complex data and identifying clusters and any statistical significance that is present within that data.
Transactional data – dynamic data that changes over time
Transparency – consumers want to know what happens with their data and organizations have to be transparent about that


Un-structured data – Data that has no identifiable structure – for example, the text of email messages.


Value – all that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data.
Variability – it means that the meaning of the data can change (rapidly). In (almost) the same tweets for example a word can have a totally different meaning
Variety – data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data
Velocity – the speed at which the data is created, stored, analysed and visualized
Veracity – organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Veracity refers to the correctness of the data
Visualization – with the right visualizations, raw data can be put to use. Visualizations of course do not mean ordinary graphs or pie-charts. They mean complex graphs that can include many variables of data while still remaining understandable and readable
Volume – the amount of data, ranging from megabytes to brontobytes


Weather data – an important open public data source that can provide organisations with a lot of insights if combined with other sources


XML Databases – XML Databases allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed.


Yottabytes – approximately 1000 Zettabytes, or 250 trillion DVD’s. The entire digital universe today is 1 Yottabyte and this will double every 18 months.


Zettabytes – approximately 1000 Exabytes or 1 billion terabytes. Expected is that in 2016 over 1 zettabyte will cross our networks globally on a daily basis.

ZooKeeper – ZooKeeper is a software project of the Apache Software Foundation, a service that provides centralized configuration and open code name registration for large distributed systems. ZooKeeper is a subproject of Hadoop