Wednesday, 24 July 2019

Big Data Analytics Data Life Cycle


Traditional Data Mining Life Cycle
In order to provide a framework to organize the work needed by an organization and deliver clear insights from Big Data, it’s useful to think of it as a cycle with different stages. It is by no means linear, meaning all the stages are related with each other. This cycle has superficial similarities with the more traditional data mining cycle as described in CRISP methodology.

CRISP-DM Methodology
The CRISP-DM methodology that stands for Cross Industry Standard Process for Data Mining, is a cycle that describes commonly used approaches that data mining experts use to tackle problems in traditional BI data mining. It is still being used in traditional BI data mining teams.

Take a look at the following illustration. It shows the major stages of the cycle as described by the CRISP-DM methodology and how they are interrelated.

Life Cycle
CRISP-DM was conceived in 1996 and the next year, it got underway as a European Union project under the ESPRIT funding initiative. The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project was finally incorporated into SPSS. The methodology is extremely detailed oriented in how a data mining project should be specified.

Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle −

Business Understanding − This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition. A preliminary plan is designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

Data Understanding − The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

Data Preparation − The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Modeling − In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, it is often required to step back to the data preparation phase.

Evaluation − At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to evaluate the model thoroughly and review the steps executed to construct the model, to be certain it properly achieves the business objectives.

A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Deployment − Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer.

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process.

In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model, it is important for the customer to understand upfront the actions which will need to be carried out in order to actually make use of the created models.

SEMMA Methodology
SEMMA is another methodology developed by SAS for data mining modeling. It stands for Sample, Explore, Modify, Model, and Asses. Here is a brief description of its stages −

Sample - The process starts with data sampling, e.g., selecting the dataset for modeling. The dataset should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.

Explore - This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization.

Modify - The Modify phase contains methods to select, create and transform variables in preparation for data modeling.

Model - In the Model phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.

Assess - The evaluation of the modeling results shows the reliability and usefulness of the created models.

The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modeling aspect, whereas CRISP-DM gives more importance to stages of the cycle prior to modeling such as understanding the business problem to be solved, understanding and preprocessing the data to be used as input, for example, machine learning algorithms.

Big Data Life Cycle
In today’s big data context, the previous approaches are either incomplete or suboptimal. For example, the SEMMA methodology disregards completely data collection and preprocessing of different data sources. These stages normally constitute most of the work in a successful big data project.

A big data analytics cycle can be described by the following stage

Business Problem Definition
Research
Human Resources Assessment
Data Acquisition
Data Munging
Data Storage
Exploratory Data Analysis
Data Preparation for Modeling and Assessment
Modeling
Implementation

In this section, we will throw some light on each of these stages of big data life cycle.

Business Problem Definition
This is a point common in traditional BI and big data analytics life cycle. Normally it is a non-trivial stage of a big data project to define the problem and evaluate correctly how much potential gain it may have for an organization. It seems obvious to mention this, but it has to be evaluated what are the expected gains and costs of the project.

Research
Analyze what other companies have done in the same situation. This involves looking for solutions that are reasonable for your company, even though it involves adapting other solutions to the resources and requirements that your company has. In this stage, a methodology for the future stages should be defined.

Human Resources Assessment
Once the problem is defined, it’s reasonable to continue analyzing if the current staff is able to complete the project successfully. Traditional BI teams might not be capable to deliver an optimal solution to all the stages, so it should be considered before starting the project if there is a need to outsource a part of the project or hire more people.

Data Acquisition
This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally requiring a significant amount of time to be completed.

Data Munging
Once the data is retrieved, for example, from the web, it needs to be stored in an easyto-use format. To continue with the reviews examples, let’s assume the data is retrieved from different sites where each has a different display of the data.

Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read this as a mapping for the response variable {1, 2, 3, 4, 5}. Another data source gives reviews using two arrows system, one for up voting and the other for down voting. This would imply a response variable of the form y ∈ {positive, negative}.

In order to combine both the data sources, a decision has to be made in order to make these two response representations equivalent. This can involve converting the first data source response representation to the second form, considering one star as negative and five stars as positive. This process often requires a large time allocation to be delivered with good quality.

Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this point. The most common alternative is using the Hadoop File System for storage that provides users a limited version of SQL, known as HIVE Query Language. This allows most analytics task to be done in similar ways as would be done in traditional BI data warehouses, from the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.

This stage of the cycle is related to the human resources knowledge in terms of their abilities to implement different architectures. Modified versions of traditional data warehouses are still being used in large scale applications. For example, teradata and IBM offer SQL databases that can handle terabytes of data; open source solutions such as postgreSQL and MySQL are still being used for large scale applications.

Even though there are differences in how the different storages work in the background, from the client side, most solutions provide a SQL API. Hence having a good understanding of SQL is still a key skill to have for big data analytics.

This stage a priori seems to be the most important topic, in practice, this is not true. It is not even an essential stage. It is possible to implement a big data solution that would be working with real-time data, so in this case, we only need to gather data to develop the model and then implement it in real time. So there would not be a need to formally store the data at all.

Exploratory Data Analysis
Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mandatory. The objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good stage to evaluate whether the problem definition makes sense or is feasible.

Data Preparation for Modeling and Assessment
This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier detection, normalization, feature extraction and feature selection.

Modelling
The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.

Implementation
In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, in order to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model.

Tuesday, 10 April 2018

Introduction to Android

What is Android?

Android is an open source and Linux-based Operating System for mobile devices such as smartphones and tablet computers. Android was developed by the Open Handset Alliance, led by Google, and other companies.

Android offers a unified approach to application development for mobile devices which means developers need only develop for Android, and their applications should be able to run on different devices powered by Android.

The first beta version of the Android Software Development Kit (SDK) was released by Google in 2007 where as the first commercial version, Android 1.0, was released in September 2008.

On June 27, 2012, at the Google I/O conference, Google announced the next Android version, 4.1 Jelly Bean. Jelly Bean is an incremental update, with the primary aim of improving the user interface, both in terms of functionality and performance.

The source code for Android is available under free and open source software licenses. Google publishes most of the code under the Apache License version 2.0 and the rest, Linux kernel changes, under the GNU General Public License version 2.

Why Android ?
Features of Android
Android is a powerful operating system competing with Apple 4GS and supports great features. Few of them are listed below:

Feature & Description

1 Beautiful UI
Android OS basic screen provides a beautiful and intuitive user interface.

2 Connectivity
GSM/EDGE, IDEN, CDMA, EV-DO, UMTS, Bluetooth, Wi-Fi, LTE, NFC and WiMAX.

3 Storage
SQLite, a lightweight relational database, is used for data storage purposes.

4 Media support
H.263, H.264, MPEG-4 SP, AMR, AMR-WB, AAC, HE-AAC, AAC 5.1, MP3, MIDI, Ogg Vorbis, WAV, JPEG, PNG, GIF, and BMP.

5 Messaging
SMS and MMS

6 Web browser
Based on the open-source WebKit layout engine, coupled with Chrome's V8 JavaScript engine supporting HTML5 and CSS3.

7 Multi-touch
Android has native support for multi-touch which was initially made available in handsets such as the HTC Hero.

8 Multi-tasking
User can jump from one task to another and same time various application can run simultaneously.

9 Resizable widgets
Widgets are resizable, so users can expand them to show more content or shrink them to save space.

10 Multi-Language
Supports single direction and bi-directional text.

11 GCM
Google Cloud Messaging (GCM) is a service that lets developers send short message data to their users on Android devices, without needing a proprietary sync solution.

12  Wi-Fi Direct
A technology that lets apps discover and pair directly, over a high-bandwidth peer-to-peer connection.

13 Android Beam
A popular NFC-based technology that lets users instantly share, just by touching two NFC-enabled phones together.

Android Applications
Android applications are usually developed in the Java language using the Android Software Development Kit.
Once developed, Android applications can be packaged easily and sold out either through a store such as Google Play, SlideME, Opera Mobile Store, Mobango, F-droid and the Amazon Appstore.

Android powers hundreds of millions of mobile devices in more than 190 countries around the world. It's the largest installed base of any mobile platform and growing fast. Every day more than 1 million new Android devices are activated worldwide.
This tutorial has been written with an aim to teach you how to develop and package Android application. We will start from environment setup for Android application programming and then drill down to look into various aspects of Android applications.

Categories of Android applications
There are many android applications in the market. The top categories are

History of Android
The code names of android ranges from A to N currently, such as Aestro, Blender, Cupcake, Donut, Eclair, Froyo, Gingerbread, Honeycomb, Ice Cream Sandwitch, Jelly Bean, KitKat, Lollipop and Marshmallow. Let's understand the android history in a sequence.

What is API level?
API Level is an integer value that uniquely identifies the framework API revision offered by a version of the Android platform.

Platform Version API Level VERSION_CODE
Android 6.0 23 MARSHMALLOW
Android 5.1 22 LOLLIPOP_MR1
Android 5.0 21 LOLLIPOP
Android 4.4W 20 KITKAT_WATCH KitKat for Wearables Only
Android 4.4 19 KITKAT
Android 4.3 18 JELLY_BEAN_MR2
Android 4.2, 4.2.2 17 JELLY_BEAN_MR1
Android 4.1, 4.1.1 16 JELLY_BEAN
Android 4.0.3, 4.0.4 15 ICE_CREAM_SANDWICH_MR1
Android 4.0, 4.0.1, 4.0.2 14 ICE_CREAM_SANDWICH
Android 3.2 13 HONEYCOMB_MR2
Android 3.1.x 12 HONEYCOMB_MR1
Android 3.0.x 11 HONEYCOMB
Android 2.3.4

Android 2.3.3

10 GINGERBREAD_MR1
Android 2.3.2

Android 2.3.1

Android 2.3

9 GINGERBREAD
Android 2.2.x 8 FROYO
Android 2.1.x 7 ECLAIR_MR1
Android 2.0.1 6 ECLAIR_0_1
Android 2.0 5 ECLAIR
Android 1.6 4 DONUT
Android 1.5 3 CUPCAKE
Android 1.1 2 BASE_1_1
Android 1.0 1 BASE

Guide to R programming language

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.

Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive.

Features of R
As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −

R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.

R has an effective data handling and storage facility,

R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

R provides a large, coherent and integrated collection of tools for data analysis.

R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. This tutorial will teach you R programming along with suitable examples in simple and easy steps.

Local Environment Setup
If you are still willing to set up your environment for R, you can follow the steps given below.

Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.

As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click and run the installer accepting the default settings. If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.

After installation you can locate the icon to run the Program in a directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings up the R-GUI which is the R console to do R Programming.

Linux Installation
R is available as a binary for many versions of Linux at the location R Binaries.

The instruction to install Linux varies from flavor to flavor. These steps are mentioned under each type of Linux version in the mentioned link. However, if you are in a hurry, then you can use yum command to install R as follows --

$ yum install R
Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt as follows −

$ R

R version 3.2.0 (2015-04-16) -- "Full of  Ingredients"         
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
       
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
           
R is a collaborative project with many  contributors.                   
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
     
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


Now you can use install command at R prompt to install the required package. For example, the following command will install plotrix package which is required for 3D charts.

> install.packages("plotrix")

Monday, 9 April 2018

Introduction to web socket programming

Web sockets are defined as a two-way communication between the servers and the clients, which mean both the parties, communicate and exchange data at the same time. This protocol defines a full duplex communication from the ground up. Web sockets take a step forward in bringing desktop rich functionalities to the web browsers. It represents an evolution, which was awaited for a long time in client/server web technology.

This tutorial has been prepared for anyone who has a basic knowledge of Protocols and understanding of HTTP. After completing this tutorial, you will find yourself at a moderate level of expertise in understanding what makes Web Sockets different from the traditional HTTP request/response pattern.

Before you start proceeding with this tutorial, we are assuming that you are already aware about the basics of JavaScript and understanding of the HTTP protocol. If you are not well aware of these concepts, then we will suggest you to go through our short tutorials on JavaScript and HTTP.

In literal terms, handshaking can be defined as gripping and shaking of right hands by two individuals, as to symbolize greeting, congratulations, agreement or farewell. In computer science, handshaking is a process that ensures the server is in sync with its clients. Handshaking is the basic concept of Web Socket protocol.

Web Sockets  Definition
Web sockets are defined as a two-way communication between the servers and the clients, which mean both the parties communicate and exchange data at the same time.
The key points of Web Sockets are true concurrency and optimization of performance, resulting in more responsive and rich web applications.

Description of Web Socket Protocol
This protocol defines a full duplex communication from the ground up. Web sockets take a step forward in bringing desktop rich functionalities to the web browsers. It represents an evolution, which was awaited for a long time in client/server web technology.

The main features of web sockets are as follows :-

Web socket protocol is being standardized, which means real time communication between web servers and clients is possible with the help of this protocol.

Web sockets are transforming to cross platform standard for real time communication between a client and the server.

This standard enables new kind of the applications. Businesses for real time web application can speed up with the help of this technology.

The biggest advantage of Web Socket is it provides a two-way communication (full duplex) over a single TCP connection.

URL
HTTP has its own set of schemas such as http and https. Web socket protocol also has similar schema defined in its URL pattern.

Protocol
Browser Support
The latest specification of Web Socket protocol is defined as RFC 6455 – a proposed standard.
RFC 6455 is supported by various browsers like Internet Explorer, Mozilla Firefox, Google Chrome, Safari, and Opera.

Web Sockets Implementation

Web Sockets occupy a key role not only in the web but also in the mobile industry. The importance of Web Sockets is given below.

Web Sockets as the name indicates, are related to the web. Web consists of a bunch of techniques for some browsers; it is a broad communication platform for vast number of devices, including desktop computers, laptops, tablets and smart phones.

HTML5 app that utilizes Web Sockets will work on any HTML5 enabled web browser.

Web socket is supported in the mainstream operating systems. All key players in the mobile industry provide Web Socket APIs in own native apps.

Web sockets are said to be a full duplex communication. The approach of Web Sockets works well for certain categories of web application such as chat room, where the updates from client as well as server are shared simultaneously.

Web Socket
Web Sockets, a part of the HTML5 specification, allow full duplex communication between
web pages and a remote host. The protocol is designed to achieve the following benefits, which can be considered as the key points:-

Reduce unnecessary network traffic and latency using full duplex through a single connection (instead of two).

Streaming through proxies and firewalls, with the support of upstream and downstream communication simultaneously.

Web
Web Socket connections are initiated via HTTP; HTTP servers typically interpret Web Socket handshakes as an Upgrade request.

Web Sockets can both be a complementary add-on to an existing HTTP environment and can provide the required infrastructure to add web functionality. It relies on more advanced, full duplex protocols that allow data to flow in both directions between client and server.

Functions of Web Sockets
Web Sockets provide a connection between the web server and a client such that both the parties can start sending the data.

The steps for establishing the connection of Web Socket are as follows.

The client establishes a connection through a process known as Web Socket handshake.

The process begins with the client sending a regular HTTP request to the server.

An Upgrade header is requested. In this request, it informs the server that request is
for Web Socket connection.

Web Socket URLs use the ws scheme. They are also used for secure Web Socket connections,
which are the equivalent to HTTPs.

Saturday, 7 April 2018

web service security


Security is critical to web services. However, neither XML-RPC nor SOAP specifications make any
explicit security or authentication requirements.

There are three specific security issues with web services −

Confidentiality
Authentication
Network Security
Confidentiality
If a client sends an XML request to a server, can we ensure that the communication remains confidential?

Answer lies here

XML-RPC and SOAP run primarily on top of HTTP.
HTTP has support for Secure Sockets Layer (SSL).
Communication can be encrypted via SSL.
SSL is a proven technology and widely deployed.
A single web service may consist of a chain of applications. For example, one large service might tie
together the services of three other applications. In this case, SSL is not adequate; the messages
need to be encrypted at each node along the service path, and each node represents a potential weak
 link in the chain. Currently, there is no agreed-upon solution to this issue, but one promising
solution is the W3C XML Encryption Standard. This standard provides a framework for encrypting and
decrypting entire XML documents or just portions of an XML document. You can check it at
https://www.w3.org/Encryption

Authentication
If a client connects to a web service, how do we identify the user? Is the user authorized to use the service?

The following options can be considered but there is no clear consensus on a strong authentication scheme.

HTTP includes built-in support for Basic and Digest authentication, and services can therefore be protected in much the same manner as HTML documents are currently protected.

SOAP Digital Signature (SOAP-DSIG) leverages public key cryptography to digitally sign SOAP messages.
It enables the client or server to validate the identity of the other party. Check it at https://www.w3.org/TR/SOAP-dsig.

The Organization for the Advancement of Structured Information Standards (OASIS) is working on
the Security Assertion Markup Language (SAML).

Network Security
There is currently no easy answer to this problem, and it has been the subject of much debate.
For now, if you are truly intent on filtering out SOAP or XML-RPC messages, one possibility is to filter out all HTTP POST requests that set their content type to text/xml.

Another alternative is to filter the SOAPAction HTTP header attribute. Firewall vendors are also
currently developing tools explicitly designed to filter web service traffic.

Friday, 6 April 2018

Why Web Services

Here are the benefits of using Web Services:

Exposing the Existing Function on the network
A web service is a unit of managed code that can be remotely invoked using HTTP, that is, it can be activated using HTTP requests. Web services allows you to expose the functionality of your existing code over the network. Once it is exposed on the network, other application can use the functionality of your program.

Interoperability
Web services allow various applications to talk to each other and share data and services among themselves. Other applications can also use the web services. For example, a VB or .NET application can talk to Java web services and vice versa. Web services are used to make the application platform and technology independent.

Standardized Protocol
Web services use standardized industry standard protocol for the communication. All the four layers (Service Transport, XML Messaging, Service Description, and Service Discovery layers) use well-defined protocols in the web services protocol stack. This standardization of protocol stack gives the business many advantages such as a wide range of choices, reduction in the cost due to competition, and increase in the quality.

Low Cost Communication
Web services use SOAP over HTTP protocol, so you can use your existing low-cost internet for
implementing web services. This solution is much less costly compared to proprietary solutions
like EDI/B2B. Besides SOAP over HTTP, web services can also be implemented on other reliable
transport mechanisms like FTP.