Neeraj Nathani SmartBridgeTradingSolutionsPvtLtd : March 2008

Monday, 17 March 2008

Big data Neeraj Nathani SmartBridge

Big data

Big data^[1]^[2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,^[3] search, sharing, transfer, analysis,^[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."^[5]^[6]^[7]

As of 2012^[update], limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.^[8]^[9] Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,^[10] connectomics, complex physics simulations,^[11] and biological and environmental research.^[12] The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.^[13]^[14] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;^[15] as of 2012^[update], every day 2.5 quintillion (2.5×10¹⁸) bytes of data were created.^[16] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.^[17]

Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers".^[18] What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."^[19]^.

Definition

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012^[update] ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.
In a 2001 research report^[20] and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.^[21] In 2012, Gartner updated its definition as follows: "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."^[22

Examples

Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, forecasting drive times for new home buyers, medical records, photography archives, video archives, and large-scale e-commerce.

Big science

The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.999% of these streams, there are 100 collisions of interest per second.^[23]^[24]^[25]

As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×10²⁰) bytes per day, almost 200 times higher than all the other sources combined in the world.

Science and research

When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.^[5]
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.^[5]
Computational social science — Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviour and real-world economic indicators.^[26]^[27]^[28] The authors of the study examined Google queries logs made by Internet users in 45 different countries in 2010 and calculated the ratio of the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘future orientation index’.^[29] They compared the future orientation index to the per capita GDP of each country and found a strong tendency for countries in which Google users enquire more about the future to exhibit a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.

Government

In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems facing the government.^[30] The initiative was composed of 84 different big data programs spread across six departments.^[31]
Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.^[32]
The United States Federal Government owns six of the ten most powerful supercomputers in the world.^[33]
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.^[34]
The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When finished, the facility will be able to handle yottabytes information collected by the NSA over the Internet.^[35]^[36]

Private sector

Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.^[37]
Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.^[5]
Facebook handles 50 billion photos from its user base.
FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.^[38]
The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates.^[39]
Infosys has also launched the BigDataEdge to analyse the Big data.^[40]^[41]
Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day ^[42]

International development

Following decades of work in the area of the effective usage of information and communication technologies for development (or ICT4D), it has been suggested that Big Data can make important contributions to international development.^[43]^[44] On the one hand, the advent of Big Data delivers the cost-effective prospect to improve decision-making in critical development areas such as health care, employment, economic productivity, crime and security, and natural disaster and resource management.^[45] On the other hand, all the well-known concerns of the Big Data debate, such as privacy, interoperability challenges, and the almighty power of imperfect algorithms, are aggravated in developing countries by long-standing development challenges like lacking technological infrastructure and economic and human resource scarcity. "This has the potential to result in a new kind of digital divide: a divide in data-based intelligence to inform decision-making."^[45]

Technologies

File:DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and is developing the tools to exploit that knowledge.tiff

DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets.

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report^[49] suggests suitable technologies include A/B testing, association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, anomaly detection, predictive modelling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis and visualisation. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation,^[50] such as multilinear subspace learning.^[51] Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.^[^{citation needed}^]

Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.^[52]

DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi.

The practitioners of big data analytics processes are generally hostile to slower shared storage^[^{citation needed}^], preferring direct-attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.

There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011^[update] did not favour it.^[53]

Source: Wikipedia.

Friday, 14 March 2008

Neeraj Nathani SmartBridge Trading Solutions Pvt Ltd

(Neeraj Nathani SmartBridge Trading Solutions Pvt Ltd)

The Big Value In Big Data: Seeing Customer Buying Patterns

Is Big Data simply a popular catchphrase or the launch of a new era? Overuse doesn’t automatically transform a buzzword into a best business practice, and several factors, such as measurable results, focus and sustainability, determine whether an idea is much more than just that. The core question is this: does big data actually solve real-world business problems? The short answer is yes – and here’s a real world example of how leveraging Big Data can solve the complexity around product proliferation by helping companies align product offering and supply chain based on customer-buying patterns.

Big Data: More Than Just A Trend

According to Gartner, unstructured and structured data held by enterprises continues to grow at explosive rates. However, volume and velocity of data – what the business world is beginning to understand as the “Big Data Problem” – are becoming less of an issue than the variety of data. Each silo within the enterprise – operations, supply management, sales, marketing – faces its own data variety challenges, where bits exist in a multitude of formats and types.

Neeraj Nathani:

Due to the variability of data across silos, systems can’t “speak” to one another, and gaining an accurate, enterprise-wide view of demand and performance seems impossible. In fact, most business and IT managers accept the lack of intersystem collaboration as a given, an inevitable limit that must be worked around. As a result, what we know is being increasingly outpaced by the things we don’t know. Performance within individual silos is clear, but this view does little to inform effective strategic direction for the organization.

There is a better way to tackle this challenge in variety and capture the opportunity posed by Big Data.

Patterns And Connections
Traditionally business silos have individually sorted, stored and managed data that is relevant to the needs within the silo. This has resulted in enterprise-related data being disrupted into disconnected pieces across the silos. The Big Data challenge requires aggregating the data across these silos for a single view of the organization. The purpose of the aggregation is to reveal the intelligence and causality across the business functions for strategic insight. This Big Data challenge requires solutions that can harness the intelligence from the data and deliver actionable intelligence to the business user. Conventional business intelligence and data warehouse tools aren’t designed to analyze, identify and surface critical data linkages and causality. As a result, insights, contexts and market opportunities remain hidden from view because users don’t know what they don’t know.
These critical connections and causalities are the key to managing big data and allowing companies to see a more comprehensive picture of their product and product variants based on actual customer-buying patterns. The connection patterns reveal valuable information quickly, accurately and allow for faster, more relevant decision-making.
Freeing the data to reveal connections and causation through pattern-based analytics solutions will paint a bigger picture – one that can better manage product variants and streamline sales by shedding light on what customers are buying, when, where and how. Currently companies pour their non-standard data into spreadsheets that then require teams of data analysts to interpret and derive meaning from it. This is not scalable and often misses the mark. Big data demands applications that can interpret and deliver immediate actionable intelligence to business users

Leveraging Data To Address Product Proliferation

One of the major business problems companies are facing today is the complexity caused by product proliferation – one of the biggest drivers of material cost and inventory levels. This is a complex issue that isn’t easily resolved with traditional approaches, but can be addressed with a systematic and enterprise-wide pattern-based analytics approach to leveraging the data across the silos within the organization.

For high value manufacturing, product variety is the single biggest driver of cost as it dictates the material requirements and inventory, which is typically over 80 percent of the total cost. Chasing diverse customer demands with an explosion of product variants dramatically increases total cost and causes volatility across the supply chain. But what are customers really buying? Are customers really buying all that companies are offering? Is there a way to satisfy them with fewer variations? Or alternate variations that improve supply chain efficiency? This is crucial knowledge for not only satisfying the demand, but also reducing supply chain costs and increasing margins.

As product proliferation has increased, so have operational complexity and cost structure; “what-if” scenarios with alternate product portfolios that meet the majority of customer demand have been an underutilized lever and represent the next frontier of business process improvement. A poor product mix will drive complexity throughout the value chain – impacting supply chain, marketing and sales, and service and support. In the same vein, a good product mix can dramatically improve delivery costs and increase profits.

The product portfolio also impacts the sales efficiency and top line revenue. To understand the opportunity cost with sales, consider the typical OEM salesperson: Let’s say he or she has a $4 million quota and spends 10.5% of his or her time defining a customer solution, configuring and pricing it, and then tracking delivery. Reducing that time by just 1 percent for a 100-person sales force represents an opportunity cost of $42 million.

Pattern-based analytics solutions are already being leveraged by some companies, giving them a much-needed competitive advantage in a high-risk environment by providing insights into customer-buying patterns to guide the product offering, supply chain planning and execution. For example, when NCR needed to optimize product configurations across their ATM line, they implemented pattern-based analytics solutions to analyze customer-buying patterns. With insight into what’s selling where, to whom, when and how often, NCR optimized the product line, defined customer segments and then seamlessly pushed this to the sales team to help reduce lead times.

At NCR, this resulted in a dramatic improvement in sales efficiency and supply chain performance. Pattern-based analytics can reveal deep insight into what customers are buying, and leverages that to offer and guide customers to the best choices based on availability and product margin. This approach uses customer-buying patterns to create the best product offering, and simultaneously guides customers to the best choices based on what is on hand. This is a win-win as the supply chain builds what is being bought and the sales reps sell what is in stock.

Source: Wikipedia.

Thursday, 6 March 2008

Neeraj Nathani SmartBridge Trading Solutions Pvt Ltd

(Neeraj Nathani SmartBridge Trading Solutions Pvt Ltd)

Trade Promotion Forecasting

Trade Promotion Forecasting (TPF) is the process that attempts to discover multiple correlations between trade promotion characteristics and historic demand in order to provide accurate demand forecasting for future campaigns. The ability to distinguish the uplift or demand due to the impact of the trade promotion as opposed to baseline demand is fundamental to model promotion behavior. Model determination enables what-if analysis to evaluate different campaign scenarios with the goal of improving promotion effectiveness and ROI at the product-channel level by selecting the best scenario.

Trade promotion forecasting challenges

Trade Promotion spending is one of the consumer goods industry’s largest expenses with costs for major manufacturers ranging from 10 percent to 20 percent of gross sales. Understandably, 67 percent of respondents to a recent survey said they were concerned about the return on investment (ROI) gained from such spending. Quantifying ROI depends heavily on the ability to accurately identify the “baseline” demand (the demand that would exist without the impact of the trade promotion) and the uplift.^[1]

In fact, forecast accuracy plays a critical role in the success of consumer goods companies. Aberdeen Group research found that best-in-class companies (with an average forecast accuracy of 72 percent) have an average promotion gross margin uplift of 28 percent, while laggard companies (with an average forecasting accuracy of only 42 percent) have a gross margin uplift of less than 7 percent.^[2]

A bottom-up sales forecast at the SKU-account/POS level requires taking into account product attributes, historical sales levels and store specifics. A complicating factor is that the large number of different variables which describe the product, the store and the promotion attributes, both quantitative and qualitative, could potentially have many different values. Selecting the most important variables and incorporating them into a prediction model is a challenging task.^[3]

Despite these challenges, two-thirds of companies in the consumer supply chain consider forecast accuracy a high business priority. 74 percent said it would be helpful to develop a bottom-up forecast based on stock-keeping unit (SkU) by key customer.^[4]

Traditional trade promotion forecasting methods

Many companies forecast the impact of trade promotions primarily through a human expert approach. Human experts are unable to take into account all the variables involved and also cannot provide an analytic prediction of campaign behavior and trends. A recent survey by Aberdeen Group showed that 78 percent of companies used Microsoft Excel spreadsheets as their primary trade promotion technology tool. The limitations of relying upon spreadsheets for trade promotion planning and forecasting include lack of visibility, ineffectiveness and difficulty in tracking deductions.^[5]

Specialized applications have been developed and become more common. 35 percent of companies now use legacy systems, 30 percent use Sales and Operations Planning (S&OP) applications, 26 percent use integrated Enterprise Resource Planning (ERP) modules and 17percent use home grown trade promotion solutions. These applications support the planning process, while still primarily relying on human knowledge and intuition for forecasting. One problem with this approach is that humans tend to make optimistic assumptions when forecasting and planning. The result is that forecasts most commonly err on the optimistic side and that human forecasters also tend to underestimate the amount of uncertainty in their forecasts.^[6]

A further issue is that the majority of manufacturers use legacy trade promotion systems that contribute to internal fragmentation of trade marketing data. Many of these companies are currently using assumption-based forecasts with limited accuracy.^[7]

Analytic approaches to trade promotion forecasting

TPF is complicated by the fact that campaigns are described by both quantitative (such as price and discount) and qualitative (such as display space and support by sales representatives) variables. New approaches are being developed to address this and other challenges. Most of these approaches attempt to incorporate large amounts of heterogeneous data in the forecasting process. One researcher validated the ability of multivariate regression models to forecast the impact on sales of a product of many variables including price, discount, visual merchandizing, etc.^[8]

The term Big Data describes the increasing volume and velocity of heterogeneous data that is coming into the enterprise. The challenge is to combine this data across all of the silos within the organization for a single view. The data can be used to improve trade promotion forecast accuracy because it usually contains real connections and causation that can help to better understand what customers are buying, where they are buying it, why they are buying and how they are buying.^[9]

Traditional methods are insufficient to assimilate and process such a large volume of data. Therefore more sophisticated modeling and algorithms have been developed to address the problem. Some companies have begun using machine learning methods to utilize the massive volumes of unstructured and structured data they already hold to better understand these connections and causality.^[10]

Machine learning can make it possible to recognize the shared characteristics of promotional events and identify their effect on normal sales. Learning machines use universal approximations of nonlinear functions to model complex nonlinear phenomena. Learning machines process sets of input and output data and develop a model of their relationship. Based on this model, learning machines forecast outputs associated with new sets of input data.^[10]

Intelligible Machine Learning (IML) is an implementation of Switching Neural Networks that has been applied to TPF. Starting from a collection of promotional characteristics, IML is able to identify and present in intelligible form existing correlations between relevant attributes and uplift. This approach is designed to automatically select the most suitable uplift model in order to describe the future impact of a planned promotion. In addition, new promotions are automatically classified using the previously trained model, thus providing a simple way of studying different what-if scenarios.^[11]

TPF systems should be capable of correlating and analyzing vast amounts of raw data in different formats such as corporate sales histories and online data from social media. The analysis should be able to be performed very quickly so planners can respond quickly to demand signals.^[12]

Groupe Danone used machine learning technology for trade promotion forecasting of a range of fresh products characterized by dynamic demand and short shelf life. The project increased forecast accuracy to 92 percent resulting in an improvement in service levels to 98.6 percent, a 30 percent reduction in lost sales and a 30 percent reduction in product obselecense.^[13

Source: Wikipedia.