Friday, November 29, 2019

Why NumPy Python library is an important package?

The following are some of the top reasons why learning about NumPy will help you going forward:

● Operation speed

You might not know about this, but NumPy is written in one of the oldest programming languages, C. One of the properties you benefit from is that it can execute faster than other packages. This makes a lot of sense when you think about Python as a whole being a dynamic language that needs interpretation. Before interpretation, Python code has to be converted to bytes. A compiled C code will definitely perform faster than the average Python code.

There are specific Python versions that are faster than others. For example, programs written in Python 2 are relatively faster than those written in Python 3. The efficiency is between 5 and 14%, so most people will never notice the performance lag, unless you are very keen. NumPy arrays are stored in blocks of the same type and size. Because of this reason, they are easier to access and execute where necessary. On the other hand, Python uses lists for most tasks. A single list could contain different types of objects, and as a result, rendering a Python code is relatively slower than C
loops, hence NumPy is a very fast package.

● Support for other libraries

One of the reasons why NumPy is an important language to learn is because it supports most of the Python libraries. Through NumPy, you can use Pandas, SciPy, SymPy and many others. In fact, SciPy and NumPy pretty much work hand in hand.

In NumPy, you should also be able to perform lots of linear algebra functions. This is an important part of data analysis, which also hinges on SciPy. Most of the time, you will need to install NumPy and SciPy together to enhance your performance in data analysis or scientific computing.

● Matrix computations

Through the ndarray functions, you can perform a lot of computations involving matrices in NumPy. There are so many matrix computations that you can perform through this package, including raising matrices to specific powers and deriving the product of two matrices.

A lot of the work required in data analysis involves algebraic equations and computations. Performing these in NumPy makes your work easier and enhances your ability to deliver the best outcome.

● Functional package

If there is one reason why using NumPy will be a good idea for you, it is the fact that it supports many functions. Most of the functions built to support different packages are already built into NumPy, so you don’t need to download them independently.

From math computations, to linear algebra, indices, random samples, statistics and polynomials, you will never run out of supporting options when working in NumPy. This further enhances your ability to analyze different types of data and draw conclusive remarks from them.

● Universal support

NumPy uses universal functions, referred to as ufuncs . These are functions that apply to each element in an array input. Owing to their universal nature, the outcome in the output array is stored in the same file size as the input.

Beyond this, you will also find the array broadcasting feature coming in handy, especially when working with different arrays. By default, arrays are available in unique sizes and shapes, and they can all be used within the same function.

Because of the universality of NumPy, your system will automatically adjust the shapes to ensure they match the shape and size of the largest array in your code. NumPy is one of the first Python libraries you should master. Knowledge of NumPy will help you advance into other libraries like SciPy which are equally important, and will form a great part of your data analysis journey.

Share:

Thursday, November 28, 2019

NumPy - Slicing and Indexing

By now I expect that you are familiar with slicing standard Python lists. The same knowledge applies when slicing one-dimensional NumPy arrays. You will also learn how to flatten arrays. Flattening arrays simply means converting a multidimensional array into a one-dimensional array.

The ravel() function can manipulate the shape of an array as follows:

Input
b
Output
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]]])

Input
b.ravel ()

Output
array([ 0, 1, 2, 3, 4, 5, 6, 7])

The flatten() function performs the same task as ravel() . However, the difference is that in the flatten function, the array is allocated new memory.

It is possible to set the shape of a tuple without using the reshape() function. This is done as follows:

Input
b.shape = (3,4)

Input
b

Output
array
([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11])

Transposition is a common procedure in linear algebra where you convert the rows into columns and columns into rows. Using the example above, we will have the following output:

Input
b.transpose = ()

Output:
array
([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11])

It is possible to stack an array by the depth, vertical alignment, or horizontal alignment. For this purpose, you will use the following functions:

● hstack()
● dstack()
● vstack()

For a horizontal stack, the ndarray tuple is as shown below:

Input:
hstack((a, b))

For a vertical stack, the ndarray tuple is as shown below:

Input:
vstack((a, b))

For a depth stack, the ndarray tuple is as shown below:

Input:
dstack((a, b))
Share:

Wednesday, November 27, 2019

NumPy - Generating Arrays

There are different ways of creating arrays. The examples in the previous illustrate the simplest, by creating a sequence or a list in the form of an argument with the array() function. Below is an example:

>>> x = np.array([[5, 7, 9],[6, 8, 10]])
>>> x
array([[5, 7, 9],[6, 8, 10]])

Other than the lists created, you can also create one or more tuples in the same manner as shown below using the array() function:

>>> x = np.array(((5, 7, 9),(6, 8, 10)))
>>> x
array([[5, 7, 9], [6, 8, 10]])

Alternatively, you can also use the same procedure to create more than one tuple as shown below:

>>> x = np.array([(1, 4, 9), [2, 4, 6], (3, 6, 9)])
>>> x
array([[1, 4, 9],[2, 4, 6],[3, 6, 9]])

As you work with ndarrays , you will come across different types of data. Generally, you will be dealing with numerical values a lot, especially float and integer values. However, the NumPy library is built to support more than those two. The following are other data types that you will use in NumPy:

● bool_
● int_
● intc, intp, int8, int16
● uint8, uint16, uint32, uint64
● float_, float16, float32, float64
● complex64, complex128

Each of the NumPy numerical types mentioned above has a unique function used to call its value as shown below:

Input
float64(52)

Output
52.0

Input
int8(52.0)

Output
52

Input
bool(52)

Output
True

Input
bool(0)

Output
False

Input
bool(52.0)

Output
True

Input
float(True)

Output
1.0

Input
float(False)

Output
0.0

Some of the functions might need a data type to complete the argument as shown below:

Input:
arrange (6, dtype=uint16)

Output:
array ([0, 1, 2, 3, 4, 5], dtype=uint16)

Before you create a multidimensional array, you must know how to create a vector as shown below:

a = arange(4)
a.dtype

Output
dtype('int64')
a
Output
array([0, 1, 2, 3])
a.shape
Output
(4,)

The vector outlined above has only four components. The value of the components is between 0 and 3. To create a multidimensional array, you must know the shape of the array as shown below:

x = array([arange(2), arange(2)])
x
Output
array([[0, 1],[0, 1]])

To determine the shape of the array, use the following function:

x.shape

Output
(2, 2)

The arrange() function has been used to build a 2 x 2 array. You will come across situations where you need to choose only one aspect of an array and ignore the rest. Before you begin, create a 2 x 2 matrix as shown below:

a = array([[10,20],[30,40]])
a

Output

array([[10, 20],
[30, 40]])

From the array above, we are going to select an item. Keep in mind that the index numbers in NumPy always start from 0.

Input: a (0, 0)

Output:
10

Input: a (0, 10)

Output
20

Input: a (10, 0)

Output
30

Input: a (10, 10)

Output
40

From the example above, you can see how easy it is to select specific elements from an array. Given an array a, as above, we have the notation a(x, y) where x and y represent the indices of each object within the array, a. From time to time you might come across character codes. It is important to know the data types associated with them as follows:

Character code    Data type

  • b                bool
  • d                double precision float
  • D               complex
  • f                 single precision float
  • i                 integer
  • S                string
  • u                unsigned integer
  • U               unicode
  • V               void

For example, a single precision floats array can be identified as shown below:

Input:
arrange (5, dtype=’f’)

Output:
array ([0, 1, 2, 3, 4], dtype=f;pat32)
Share:

Tuesday, November 26, 2019

NumPy - Statistics in Python

For data analysis, your understanding of NumPy will help in scientific computation. Knowledge of this library is a fundamental step in data analysis mastery. Once you understand NumPy, you can then build on to other libraries like Pandas.

Once you learn the basics of NumPy, you can then advance into data analytics, using linear algebra and other statistical approaches to analyze data. These are two of the most important mathematical aspects that any data analyst should know about. During data analysis, you will often be required to make predictions based on some raw data at your disposal. For example, you might be asked to present the standard deviation or arithmetic mean of some data for analysis.

In linear algebra, the emphasis is on using linear equations to solve problems through NumPy and SciPy. Mastery of the NumPy basics will help you build on the knowledge you have gained over the years, and perform complex operations in Python.

In NumPy, one of the things you should remember is file I/O. All the data you access is retrieved from files. Therefore, it is important that you learn the basic read and write operations to the said files. One of the benefits of using the NumPy library is that you are always aware that all the items contained in any array share the same type. Because of this reason, you can easily determine the size of storage needed for the array.

Once you have it installed, import the NumPy package into a new Python session as follows:

import numpy as np

As you work on NumPy, you will realize that most of the work you do is built around the N-dimensional array, commonly identified as ndarray . The ndarray refers to a multidimensional array which could hold as many items as defined.

The ndarray is also homogenous, meaning that all the items that are present in the array are of the same size and type. Each object within the array is also defined by its unique data type, (dtype ). With this in mind, each ndarray is always linked with one dtype .

Each array holds a given number of items. The items are available in different dimensions. The dimensions and items within the array define the shape of the array. These dimensions are referred to as the axes and as they compound, they form a rank .

When starting a new array, use the array() function to introduce all the elements in a Python list as shown below:

>>> x = np.array([5, 7, 9])
>>> x
array([5, 7, 9])

To determine whether the object you just created is indeed an ndarray , you can introduce the type() function as shown below:

>>> type(x)
<type 'numpy.ndarray'>

The dtype created might be associated with the ndarray . To identify this data type, you introduce the following function:

>>> x.dtype
dtype('int32')

The array above only has one axis. As a result, its rank is 1. The shape of the array above is (3,1). How do you determine these values from the array? We introduce the attribute ndim to give us the number of axes, the size to tell us the length of the array, and finally the shape attribute to determine the shape of the array as shown below:

>>> x.ndim
1
>>> x.size
3
>>> x.shape
(3L,)

In the examples we have extrapolated above, we have been working with an array in one dimension. As you proceed in data analysis, you will come across arrays that have more than one dimension. Let’s use an example where you have two dimensions below to explain this further.

>>> y = np.array([[12.3, 22.4],[20.3, 24.1]])
>>> y.dtype
dtype('float64')
>>> y.ndim
2
>>> y.size
4
>>> y.shape
(2L, 2L)

This array contains two axes, hence its rank is 2. The length of each of the axes is 2. The itemsize attribute is commonly used in arrays to tell us the size of every item within the array in bytes as shown in the example below:

>>> y.itemsize
8
>>> y.data
<read-write buffer for 0x0000000003D44DF0, size 32, offset 0 at 0x0000000003D5FEA0>

In the next post well learn about the different ways of creating arrays.
Share:

Monday, November 25, 2019

Python Libraries for Data Analysis

The standout reason why Python is quite popular is the large endowment of libraries. Each library is unique, yet extensive enough to enable programmers to solve many data problems every day. The following are some of the top libraries used in data science:

● NumPy

For numerical computations, you need Numerical Python (NumPy). NumPy is considered the foundation of numerical computations in Python. It is a general purpose array processor that uses N-dimensional array objects.

NumPy is an efficient library given that when using multidimensional arrays, you have operators and functions that work with multidimensional arrays, thereby eliminating the slowness challenge during numerical computations.

NumPy functions are precompiled, helping you complete numerical routines faster than other libraries. Through NumPy’s approach, you can perform computations faster and efficiently, especially when using vectors. NumPy is a mainstay in data analysis when you need powerful N-dimensional arrays. Libraries like Scikit-learn and SciPy have NumPy as their foundation, and you can also use NumPy in place of MATLAB if you are working with Matplotlib and SciPy.

● TensorFlow

If you are working on a high-performance computation project, TensorFlow is your best bet. There are thousands of contributors working on this library, which is a good resource pool whenever you are struggling with something.

Through TensorFlow, data scientists are able to define and run computations with tensors. A tensor is a computational object that can be manipulated to derive values. In this library, you can expect high-quality graphical visualizations, which makes it easier for you to present projects to an audience.

In neural machine learning, TensorFlow is preferred by developers because it helps them reduce errors in computations by up to 60%. This further allows them to perform parallel computing. Through parallel computing, developers can then build complex projects and execute them in a fairly simple manner.

Another benefit of using the TensorFlow library is that it enjoys support from Google. This partnership comes in handy especially in library management, as the tech giant allows a seamless support framework when using the library.

Besides that, you will always have some of the latest features when using TensorFlow because the development team behind it release updates frequently, and you can install them faster than most libraries.

Given all these benefits, you will find TensorFlow coming in handy when working on video detection projects, time series analysis, text applications, and image or speech recognition projects.

● Matplotlib

For data visualizations, Matplotlib provides some of the most amazing results in data science. It is by far the best plotting library you will use in Python. Matplotlib is essentially a data visualization library, hence the wide range of plots and graphs. To extend its utility further, Matplotlib also comes with an object-oriented API through which you can add the visualizations created into different apps.
If you have been working with MATLAB in the past, Matplotlib is a better alternative. Being an open-source library, usage is free, and you have access to a large pool of experts who can assist you in so many ways. When using Matplotlib, you are not restricted in terms of the operating system.

You can work with lots of output types and backends, thereby allowing you to create visualizations in any format you desire. Perhaps one of the best things about using Matplotlib is its behavior in use. It is very easy on memory consumption compared to other libraries. Because of the efficient memory consumption, you should expect a smooth experience at runtime, too.

Matplotlib visualizations are useful when analyzing the correlation between different variables. It presents each variable in a unique way, making it easier to spot the similarities and differences between them. You can also use it to detect outliers in a scatter plot or identify uniqueness in data distribution, helping you get a better insight into the data you are studying.

● Pandas

Python Data Analysis (Pandas) is another important library that you cannot miss in data science. Together with Matplotlib and NumPy, this library comes in handy, especially for cleaning data. Data structures in Pandas are flexible and efficient, allowing you an intuitive and easy way to program structured data.

Concerning the need to clean or wrangle data, Pandas comes second to none. Many data analysts store data in CSV files and other database files. Pandas has exceptional support especially for CSV files, allowing you to access data frames and perform transformations like extract, transform, and load on the data sets in question.

The Pandas syntax is elaborate with incredible functions to enable you to produce amazing results even if your data set is missing some fragments of data. Through Pandas, you can build unique functions and test them on different sets of data.

Pandas helps data scientists in many commercial, financial, and academic fields, especially when dealing with statistical data analysis. It is also a good library for financial computations and has recently been introduced into neuroscience.

● SciPy

For high-level computations in data science, you need Scientific Python (SciPy). It is an open-source library with thousands of members in the contributor community. SciPy is an extension of NumPy, therefore you can expect the same efficiency in NumPy when you are working on technical and scientific computations. It makes the scientific calculation more user-friendly due to the fact that its functions and algorithms are an extension of NumPy.

You will find SciPy easier to work with if you are ever working on differential problems because its functions are built into the library. This, coupled with the ndimage sub module helps in processing multidimensional images faster. The high speed is another reason why SciPy is a reliable library for data visualization and manipulation.

Where is SciPy applicable? As a data scientist, you will need SciPy if your work involves linear algebra, working with optimization algorithms, Fourier transform or any differential equations, and operations that involve multidimensional images.

These are the main libraries you will use for data. In case you are using pip, you can install the directories through the following commands:
pip install numpy
pip install scipy
pip install matplotlib
pip install ipython

In the coming posts we'll start exploring these libraries starting with NumPy.
Share:

Saturday, November 23, 2019

Shortcomings of Analyzing Data in Python

Given all the buzz about data and data analysis, it might come as a surprise to a lot of people, but data analysis does have unique challenges that are impeding the expected deliverables. One of the biggest challenges that data analysts have to work through is the fact that most of the data they rely on are user-level based.

Because of this reason, there is room for a lot of errors which eventually affects the credibility of the data and reports obtained therefrom. Whether in marketing or any other department in the business that relies on data, the unpredictability of user-level data means some data will be relevant to some
applications and projects, but not all the time. This brings about the challenge of using and discarding data, or alternatively keeping the data and updating it over time.

While Python offers these benefits, it is also important to be aware of some of the challenges and limitations you might experience when programming in Python. This way, you know what you are getting into, and more importantly, you come prepared. Below we will discuss some of the challenges that arise for data analysts when they have to work with this kind of data.

● Input bias

One of a data analyst’s biggest concerns revolves around the reliability of the data at their behest. Most of the data they have access to, especially at the data collection points like online ads from the company, are not 100% reliable. Contrary to what is expected, this input does not usually present the true picture of events concerning the interaction between customers and the brand. Today there are several ways data analysts can try to obtain credible and accurate information about customers. One of these is through cookies that can be tracked. Cookies might present some data, but the accuracy of the data will always come under scrutiny.

Think about a common scenario where you have different devices, each of which you can use to go online and check some information about your favorite brand. From this example, it is not easy to determine the point at which the sale was made. All the devices belong to you, but you could have made a purchase decision from one of them, but used another to proceed with the purchase. This
level of fragmentation makes it difficult to effectively track customer data. It is likely that data obtained from customers who own different devices will not be accurate. Because of this reason, there is always the risk of using inaccurate data.

● Speed

By now, it is no secret that Python is relatively slower compared to majority of the programming languages. Take C++ for example, which executes code faster than Python. Because of this reason, you might need to supplement the speed of your applications. Many developers introduce a custom runtime for their applications which is more efficient than the conventional Python runtime.

In data analysis, speed is something that you cannot take for granted, especially if you are working with a lot of time-sensitive data. Awareness of the speed challenges you might encounter in Python programming should help you plan your work accordingly, and set realistic deliverables.

● Version compatibility

If there is a mundane challenge that you will experience in Python it is version compatibility. Many programmers consider this a mundane issue, but the ramifications are extensive. For beginner data analysts, one of the challenges is settling on the right Python version to learn. It is not an easy experience, especially when you know there is something better already.

By default, programmers consider Python version 2 as the base version. In case you need to advance to futuristic data analysis, Python version 3 is your best bet. Generally, you will receive updates to either of the versions whenever they are available. However, when it comes to computations and executing code, some challenges might arise. A lot of programmers and data analysts still prefer the
second version over the current one. This is because some of the common libraries and framework packages only support the second version.

● Porting applications

For a high-level programming language, you must use interpreters to help you convert written code into instructions that the operating system can understand better. In order to do this, you will often need to install the correct interpreter version in your computer to help in handling the applications. This can be a problem where you are unable to port one application to a different platform.

Even if you do, the porting process hardly ever goes smoothly.

● Lack of independence

For all the good that can be done with Python, it is not an independent programming language. Python depends on third-party libraries, packages, and frameworks to enable you to analyze data accordingly. Other programming languages that are available in the market today come with most of the features bundled in already, unlike Python. Any programmer interested in analyzing data in Python must make peace with the fact that they will have to use additional libraries and frameworks. This comes with unique challenges, because the only way out is to bring in open-source dependencies.

Without that, legacy dependencies would consume a lot of resources, increasing the cost of the analysis project.

● Algorithm-based analysis

There are two acceptable methods that data analysts use to study and interpret a given set of data. The first method is to analyze a sample population, and draw conclusive remarks from the assessment of the sample about the population. Given that the approach covers only a sample, it is possible that the data might not be a true representation of what the greater population is about. Samples can easily be biased, which makes it difficult to get the true version of events.

The second approach is to use algorithms to understand information about the population. Running algorithms is a better method because you can study an entire population based on the algorithm syntax. However, algorithms do not always provide the most important contextual answers. Either of the methods above will easily present actionable recommendations. However, they cannot give you answers to why customers behave a certain way.

Data without this contextual approach can be unreliable because it could mean any of a number of possibilities. For the average user, reports from algorithms will hardly answer their most pressing questions.

● Runtime errors

One of the reasons why Python is popular is because of its dynamism. This is a language that is as close to normal human syntax as possible. As a result, you will not necessarily have to define a variable before you call it in your code. You will, therefore, write code without struggling as you would in other languages like C#.

However, even as you enjoy easy coding in Python, you might come across some errors when compiling your code. The problem here arises because Python does not have stringent rules for defining variables. With this in mind, you must run a series of tests whenever you are coding to identify errors and fix them at runtime. This is a process that will cost you a lot of time and financial resources.

● Outlier risks

In data analysis, you will come across outliers from time to time. Outliers will have you questioning the credibility of your data, especially when you are using raw user data. If a single outlier can cast doubt on the viability of the dataset, imagine the effect of several outliers.

More often you will come across instances where you have weird outliers. It is not easy to interpret them. For example, you might be looking at data about your website, only to realize that for some reason, there was a spike in views during a two-hour period, which cannot be explained. If something like this happens, how can you tell whether the spike represents individual hits on your website or
whether it was just one user whose system was perhaps hacked, or experienced an unprecedented traffic spike?

Certainly, such data will affect your results if you use them in your analysis. The important question is, how do you incorporate this data into your analysis? How can you tell the cause behind the spike in traffic? Is this a random event, or is it a trend you have observed over time? You might have to clean the data to ensure you capture correct results. However, what if by cleaning the data, you assume that the spike was an erroneous outlier, when in real sense the data you are ignoring was legitimate?

● Data transfer restrictions

Data is at the heart of everything today. For this reason, it is imperative that companies do all they can to protect the data at their disposal. If you factor in the stringent data protection laws like the GDPR, protecting databases is a core concern in any organization.

Data analysts might need to share data or discuss some data with their peers, but it is impossible to do this. Access to specific data must be protected. It is therefore impossible to share data across servers or from one device to another. If you delve further into big data, most organizations do not have employees with the prerequisite skills to handle such data efficiently. As a result, data administrators must restrict the number of people who can interact with such data.

In light of these restrictions, most of the work done in analysis and recommendations thereof is the prerogative of the data analyst. There is hardly ever a second opinion because very few people have access to the database or data with similar rights. This also creates a problem where users or members of the organization are unable to provide a follow-up opinion on the data. They do not know the procedures or assumptions the analyst used to arrive at their conclusions. In the long run, data can only be validated by one or a few people in the organization who took part in the analysis. This kills the collaborative approach where it should have been allowed to thrive.


Python has enjoyed an amazing library support. In the next post we shall discuss about Python Libraries for Data Analysis.
Share:

Friday, November 22, 2019

Data Analysis in Python over Excel

For most analysts, you start with Excel then advance into Python and other languages. In the business world, Microsoft Excel is one of the most important programs, especially when it comes to collecting data. You can use it for data analysis, but there are challenges you might experience, which necessitates the move to Python programming for data analysis.

While Excel is a great tool, it has some unique challenges that you can overcome by learning Python. A bit of Python programming could really change your life and make data analysis easier for you in data science.

● Expert data handling

One of the first things you will enjoy in Python that sets it apart from Excel and other basic data analysis tools is the administrative privileges you enjoy when handling data. This is everything from importing data to manipulation.

You can upload any data file in Python, something that you cannot enjoy in Excel. There are some data formats that you generally cannot read or functionally work with in Excel, which impedes your ability to go about your work. This becomes a problem in many situations. You can also come across
data files that are unreadable, but can still work. Python generally allows you more control over data handling. Therefore, you can easily scrape data from different databases and proceed to analyze it and draw conclusions.

Granted that you can still perform a lot of tasks on your data in Excel, you might have some restrictions. These are not there in Python. You can carry out all manner of manipulation on the data you use. Think about recording, merging, and even cleaning data. Through Python libraries like Pandas, you can view and clean some data to ensure it is suitable for the purpose you intended the analysis.

To do this in Excel, you would have to spend more time than necessary, and probably never get it done properly. Therefore, other than the value in terms of utility, Python also offers you the benefit of time consciousness.

● Automated data management

Excel is an awesome program. Microsoft has spent years developing Excel into an amazing tool for data management. This we can see in the GUI. It is an easy tool for anyone to use, especially someone who lacks programming knowledge.

However, in data analysis, you need to go beyond the ordinary if you are to get the best results. More often Excel will be useful up until the moment you need to automate some processes. This is where your problems begin. Other than process automation, it is also not easy to perform an analytical process across different Excel sheets or repeat a process several times.

Programming in Python takes away these problems. Assuming you need to execute some code to analyze recurrent data, you only need to write a script that would import the new data whenever it is available, parse it, and deliver an analytical report on time. On the other hand, in Excel, you would have to manually create a new file, then key in the desired formulas and functions before proceeding with the analysis.

More importantly, in Excel you would save the data format only in the supported Excel formats. However, in Python you can save the output file in whichever database file format works for you. This means you do not have to spend more time on file conversion which in most cases interferes with the outcome.

● Economies of scale

Spare some time and study the organization of data in Excel. One feature that strikes out clearly is that data is organized in tabs and sheets. This is a prominent feature in Excel, and it works well for processes that are completely reliant on Excel. However, the problem comes in when you have a gigantic database to work with. You might be looking at Excel data sheets with lots of entries per
sheet, or a database that has too many Excel sheets.

Processing such database files will take a lot of time. This creates unnecessary lag in data analysis. Many are the times when your machine will crash, unable to process Excel sheets as fast as you need them to. In such a scenario, your only solution is to be patient and process the files one at a time. This is a challenge that you don’t have to worry about in programming. Languages like Python were specifically built to mitigate such issues. You can process large files in Python faster and more efficiently than you would in Excel.

Besides, it is highly unlikely that your device will give up on you as it would when processing datasets in Excel.

● Ability to regenerate data

In your role as a data analyst, you will need to explain your work to more people than you can imagine. Once you are done with the analysis, you might be asked to prepare a report on your findings, which another department will use to meet their objectives. Beyond that, you might also be required to present the outcome in person, and explain to a panel the decisions you make, and your
recommendations. To meet the objectives outlined above, your data must be reproducible. People who were not part of the analytical process should be able to access the data and understand it just as you do. Here’s where the problem arises when using Excel.

First of all, it is generally impossible for you to provide an elaborate illustration of the procedure and processes leading up to your recommendations. The only way you can walk anyone through your analysis is to get the original file and take them through each step.

Given the haste in which you might have done your work, this might be a challenge. Programming in Python, on the other hand, makes your work easier if you ever need to share it with someone. In some cases, all you need to do is press the OK or Enter button and the analysis will be executed as many times as you need it to. Besides, when analyzing data in Python, you can easily explain each step and have your audience follow through, executing code and seeing the results immediately.

● Debugging

If you are analyzing data in Excel, you will have a difficult time identifying errors. In fact, you have to manually look for the errors. Given a dataset with thousands of cells, this could prove to be a problem. Debugging in Excel is therefore a challenge any data analyst would not wish to deal with.
Programming languages like Python make debugging a lot easier. By design, if you enter the wrong syntax you get an error message instead of the expected output. Another good reason for analyzing data in Python is because you can trace the errors in each step. Whenever you key in the wrong functions or syntax, the program will return an error, prompting you to check and sort it out.

In Excel you would probably not know whether you have an error or not, and figuring out the genesis of the problem might force you to start from the beginning, which is more than you could have bargained for.

Since you can include comments in your code, it is easier to trace problems and sort them out. Even if you are not working with data you prepared, you can still read the comments and understand what another programmer did. At the same time, this should not be taken as an assertion that you will fix all the errors you encounter right away. Some errors might take you longer to identify and solve. However, the fact remains that analyzing data in Python gives you an easier and better chance at debugging errors than in Excel.

● Open-source programming

Everything about Excel is in the hands and control of Microsoft. If the program is buggy, you must depend on Microsoft to release patches for bugs. Feature support is also a challenge because unless Microsoft updates their releases, you will have to contend with what is available.

One of the perks of programming in Python is that you are free to enjoy the benefits of open-source programming. You have access to a large community of programmers who are always willing to assist you with any concerns. As you work with some Python code for data analysis, you can improve any of the functions by altering the code accordingly, and share it with the rest of the Python community. There are so many developers who have created or updated some of the packages they use, in the process improving the functionality of the programming language. This has also resulted in better visualizations.

● Advanced operation support

When using Excel, you will struggle when it comes to machine learning and the associated features. This is because Excel was not built for these functionalities. You need advanced programming languages to help you in this regard, hence the need for Python.

In Python, you should also be able to build unique machine learning models. These can be integrated into your code through some of the popular Python frameworks like TensorFlow and Scikit-Learn, thereby enhancing your capabilities when analyzing data.

● Data visualization

You need to see what you are working on. Visualization serves different purposes in data analysis. From the perspective of the analyst, the moment you come across some data, you should easily guess the kind of plot you will use for it. Someone might quip in at this juncture that Excel does offer visualization features. Well, that might be true, but visualization in Excel can be very limited. Python offers you so much more in visualization, especially when you need advanced visualizations. In a business environment, you are called upon to make presentations all the time. Your presentation should be attention-grabbing if it is to convince someone to come onboard.

Each time you are tasked with presenting your report before a panel, remember that most of the people you engage might have no knowledge of data analysis. Therefore, it is impossible for them to read statistical data with the same precision you would. The best way of assisting such individuals would be by plotting some amazing visualizations. A good plot should be one that the audience can make sense of without straining, even if they have no knowledge of statistical computations or data analytics.

It is important to mention that this should not mean you abandon Excel altogether. Excel as a Microsoft Suite has its unique features that will come in handy in data handling and management. However, when compared against Python and other programming languages, it still has a long way to go in terms of data analysis. Perhaps one of the perks of Excel is that you can manually enter data into your database. This comes down to the GUI. If you are working with a small set of data, you can still scan through it instantly through Excel. Generally, Excel is ideal for the basic data analyst. As you advance in the field, however, you should think outside the box. Advance into Python programming so you can learn to perform better, accurate, and complex data analysis without the encumbrances of Excel.

While Python offers these benefits, it is also important to be aware of some of the challenges and limitations you might experience when programming in Python. Our next post will focus on this topic.
Share:

Wednesday, November 20, 2019

Tools Used in Data Analysis

There are several tools you need to learn about to help you in your career as a data analyst. At the basic level, you should at least have a working knowledge of web development, SQL, math, and Microsoft Excel. It also follows that you should be good at PHP, HTML, JavaScript, and know how to work with basic programming commands, libraries, and syntaxes. As an advanced user, you should also be adept in the following fields:

● R Programming

One of the challenges many data analysts experience is choosing the right programming language. Essentially, it is wise to learn as many languages as you can, because you never know what the next project you work on will demand. You might not fully understand all the programming languages, but having working knowledge is a great idea.

While there are lots of programming languages you can choose from, R programming is one that any data analyst should master. It is preferred because it is unique and versatile, particularly when dealing with statistical data. Since R is an open-source platform, you have access to several data analysts who can help you.

R is a simple, yet articulately developed program. In R programming, you will use recursive functions, loops, conditionals, and support for I/O features. R also has storage features, which is good for data handling as you proceed with your tasks. You will also find the GUI effective, which is ideal for data display.

● Python

The basics of Python programming have been discussed in the earlier books in this series. However, we can recap by highlighting the power behind this opensource programming language. Python is simple, yet it packs quite the punch in as far as other programming languages are concerned. Programmers and developers alike enjoy coding in Python because of the wide library support, which helps you in data management, manipulation, and analysis. It is one of the easiest languages to learn, especially if you have experience with other languages. The list of projects you can build in Python is
endless, especially because there are many new projects that are still being built today, which we are yet to experience. In terms of existing projects that were built through Python, think about YouTube.

● Database management

You will be working with lots of data, so data management is a skill you should master or polish up. Some of the tools you must learn include MySQL, MongoDB, MS Access, and SQL Server. These tools are mandatory for data collection, processing, and storage. More importantly, you should understand how to use commands like order by, having, group by, where, from, and select .

● MatLab

MatLab is another simple, flexible, and powerful programming language that is necessary for data analysis. Through MatLab, you can manipulate and analyze data using the native libraries. Given that the MatLab syntax is almost similar to C++ and C, prior knowledge of these programming languages will help you progress faster in MatLab.

Over the years, the use of data analysis has become important in different environments. Companies and organizations use data to gain insight into their business performance by studying how their customers interact with their brands at different data collection points. Having understood the basics of data analysis, let’s move on to data analysis with one of the most amazing programming languages, Python.

These are some of the common tools used in Data analysis. Although Excel is still widely used in data analysis but now a days Python is preferred over Excel. This will be our topic of discussion for our next post.
Share:

Tuesday, November 19, 2019

Types of Data Analysis

Different terminologies are used in data analytics depending on the type of analytics. There is so much data that can be extracted from different sources today. Understanding raw data is quite a challenge given the unpredictable nature of some forms of data. This is where data analysis comes in. Analysis deals with refining raw data into an understandable and actionable form. Here are some of the types of data analysis you will encounter:

● Descriptive Analysis

Descriptive analysis is about summaries. From the data available, you should be able to find summary answers to pertinent issues in the organization, events, or activities. Some of the tools you will use in descriptive analysis include generated narratives, pie charts, bar charts, and line graphs. At a glance, someone should get a summary of the information you present before them.

● Diagnostic Analysis

Think about diagnostic analysis in the same way you see a doctor to provide a diagnosis about your health. More often, you are only aware of the symptoms you are feeling. It is up to the doctor to run tests and rule out possibilities, then narrow down a list of possibilities and tell you what you are suffering from. In a diagnostic analysis, the goal is to use data to explain the unknown. Assuming you are looking at your marketing campaigns on social media, for example, there are so many things you can look at, from mentions, to reviews, to the number of followers and likes. These are features that indicate some activity about your brand. However, it is only through a diagnostic analysis that you can go deeper and unearth what the numbers mean in as far as engagement goes.

● Predictive Analysis

Predictive analysis is one of the common types of analysis in use in organizations today. It uses a combination of statistical algorithms and machine learning to understand data and use this to extrapolate future possibilities from historical data. For accurate predictions, the historical data must be accurate, or the predictions might be flawed.

Predictive analysis is entirely about planning for the future. You use present and historical data to determine what might happen in the future, especially when you alter a few variables that you can control. These studies focus on creating predictive models for new data.

Here I am ending this post where we discussed some of the types of data analysis that we encounter. In the next post our focus would be on Tools Used in Data Analysis





● Exploratory Analysis
Exploratory analysis is about determining trends in your data, and from there
explaining some features that you might not have been able to determine through
other analytical methods. The emphasis is on identifying outliers to understand
why and where they occur, and the variables that are affected by the outliers in as far as decision-making is concerned.
● Prescriptive Analysis
Many of the forms of analysis you use will give you a general view of your data.
A general analysis cannot give you the kind of information you need.
Prescriptive analysis is about precision. The answers you get from this analysis
are specific. It is like getting prescription medicine – the doctor recommends
specific drugs, which should be taken under specific instructions.
Assuming you are looking at data about recent road accidents, through
prescriptive analysis, you can narrow it down to accidents as a result of drunk
driving, poor road signage, roadworthiness of the vehicles, or careless driving.
Share:

Monday, November 18, 2019

Methods Used in Data Analysis

Data analysts are exposed to lots of data from time to time. The challenge is sifting through voluminous data to interpret the ramifications. There are several tools and methods that are used, especially in statistical data analysis.

In a world where big data is coming full circle, there are several tools that can help you reduce your workload, while at the same time improving your efficiency and reliability of the data you use. The methods discussed herein are the foundation of data analysis. Once you master them, it is easier to graduate into sophisticated methods and  techniques:

● Standard deviation

Standard deviation is an expression of how far data spreads from the arithmetic mean. Standard deviation in data analysis is about data point dispersion from the mean. A high value shows a large spread from the mean, while a low value means that most of the data in use is close to the mean.
Always use standard deviation alongside other techniques to derive conclusive results from your study. Without this, especially with data sets that contain many outliers, standard deviation is not a good value determinant.

● Averages

This refers to the arithmetic mean. You arrive at this by dividing the sum of (n) items on your list by the number of (n) items on the list. Averages help you understand the general trend in a specific data set. Calculating averages is very easy, and from this information, you can tell so much about a given data set at a glance.

Even as you use averages, you must be careful not to use them in isolation. Independent of other methods, averages can be misconstrued for the same information available from median and mode. If you are working with data that has a skewed distribution, averages are not the best option because you don’t get information accurate enough to support our decision-making needs.

● Regression analysis

Regression analysis is about identifying the relationship between different variables. From these relationships, you will then establish the dependency between the variables. This analysis helps you identify whether relationships between variables are weak or strong.

Regression analysis is usually a good option when you need to forecast decision making. Since they consider the relationship between dependent and independent variables, you can look at many variables that affect your business in one way or the other. The dependent variable in your study refers to the variable you need to understand. The independent variables are endless, and could represent any factors you are looking at, which might affect the dependent variable in some way.

● Hypothesis testing

This method is also referred to as t testing. In hypothesis testing, the goal is to test a given assertion to determine whether it is true or not for your study population. This method is popular in so many areas that are reliant on data, like economics and scientific and business research purposes. There are several errors that you must be aware of if your hypothesis study is to be a success. One common error in hypothesis testing is the Hawthorne effect, also known as the observer effect. In this case, the results of the study do not reflect the true picture because the participants are aware they are under observation. As a result, the results are often skewed and unreliable. Hypothesis testing helps you make decisions after comparing data against hypothetical scenarios concerning your operations. From these decisions, you can tell how some changes will affect your operation. It is about the correlation between variables.

● Determining sample sizes

You need to learn how to select the right sample size for your studies. It is not feasible to collect information from everyone in the study area. Careful selection of your sample size should help you conduct the study effectively. One of the challenges you might experience when choosing the sample size is accuracy. While you are not going to study the entire population of interest, your sample must be randomly selected in a manner that will allow you to get accurate results, without bias.

Here I am ending this post. Make sure you have good understanding of the discussed Methods Used in Data Analysis you delve into data analytics. In the next post we'll discuss about Types of Data Analysis.
Share:

Data Analysis Procedure

The data analysis methods discussed in the previous post might be different in their approaches, but the end result is almost always the same. Their core objective is to support decision-making in the organization at different levels. The following are some of the steps that you will follow during data analysis:

● Define the objectives

The objectives behind your study must be clearly outlined. This is the foundation of your study. Everything that you do from here onwards depends on how clearly the objectives of your study were stated. Objectives guide you on how to proceed, the kind of data to look for, and what the data will be used for.

● Ask the right questions

In order to meet the objectives outlined in the first step, you must seek answers to specific questions. This narrows down your focus to the things that matter, instead of going on a wild goose chase with data. Remember that by the time you collect data, the procedure in place should be effective so that you do not end up with a lot of worthless data.

● Collect data

Set up appropriate data collection points. Make sure you use the best statistical method or data collection approach to help you get the correct data for your analysis. You can collect data in different forms, especially for raw data. Once you have the data you need, the hard work begins. Sift the data to weed out inaccurate or irrelevant entries. Use appropriate tools to import and analyze data.

● Analyze data

In this stage, you aggregate and clean data into the different tools you use. From here, you can study the data to determine and define patterns and trends. This is also the stage where most if not all of your questions are answered. You will conduct “what if” analysis in this stage.

● Interpretation and predictive analysis

Having obtained the necessary information from your analysis, the final stage is to infer conclusions from the data. A predictive analysis involves making informed decisions based on the data you have, and leveraging it against some other supporting information. The data from your analysis might be quantitative.

To make a correct decision, for example, you have to consider some qualitative elements, too. You might have the prerequisite numbers, but the general feeling in the market about your business is unfavorable. Making predictions, therefore, is not just about relying on the data you collect and analyze, but an aggregate of other decision processes that are not directly related to the data.
In this stage, you will also look back to the objectives outlined earlier on. Does the data you collect sufficiently answer the questions posed earlier? Suppose there are some objections, do you feel the data available can help you convincingly challenge the objections? Is there something you intentionally ignored, or a limitation to your conclusions? What happens if you introduce an alien factor into the question? Does it affect the output? If so, how?


Share:

Saturday, November 16, 2019

Techniques Used in Data Analysis

Here is an overview of some of the techniques you will come across in data analysis:

● Data visualization

Data visualization is about presentation. You are already aware of most of the tools that are used in data visualizations, such as pivot tables, pie charts, and other statistical tools. Other than resentability, data visualization makes large sets of data easy to understand. Instead of reading tables, for example, you can see the data transposed onto a color-coded pie chart. We are visual creatures. Visual optics last longer in our minds than information we read. At a glance, you can understand what the information is about. Summaries are faster and easier through data visualization than reading raw data. One of the strengths of data visualization is that it helps in speeding up the decision-making process.

● Business intelligence

Business intelligence is a process where data is converted into actionable information in accordance with the end user’s strategic objectives. While most of the raw data might be difficult to understand or work with, through business intelligence, this data eventually makes sense. Business intelligence techniques help in determining trends, examining them and deducing useful insights.

Many companies use this to help in making decisions about their pricing and product placement strategies. This data is also helpful in identifying new markets for their products and services, and analyzing the sustainability of the said markets. In the long run, this information helps the company come up with specific strategies that help them thrive in each market segment.

● Data mining

Data mining involves studying large sets of data to determine the occurrence of patterns. Patterns help analysts identify trends, and make decisions based on their discoveries. Some of the methods used in data mining include machine learning, artificial intelligence, using databases and statistical computations.

The end result in data mining is the transformation of primitive raw data into credible information that can be used to make informed business decisions. Other than decision making, data mining can also help in finding out the existence and nature of dependency or abnormalities across different sets of data.

It is also useful in cluster analysis, a procedure where the analyst studies a given set of data to identify the presence of specific data groups. Data mining can be used alongside machine learning to help in understanding consumer behavior. Consumer tastes and preferences are traditionally dynamic.
Because of this, changes take place randomly. Given the popularity of ecommerce today, the dynamic shift in consumer tastes and preferences is more volatile than ever.

Through data mining, analysts can collect lots of information about consumer actions on their  websites, and make an accurate or near-accurate prediction of the purchase traits and frequencies. Such information is useful to marketing departments and other allied sectors in the business, to help
them create appropriate promotional content to attract and retain more customers.

Marketing savvy experts usually create niches out of a larger market demographic. The same concept applies to data mining. Through data mining, it is possible to identify groups of data that were previously unidentified. Studying such data groups is important because it allows the analyst to  experiment with undefined stimuli and in the process, probably discover new frontiers for the marketing departments.

Other than previously unidentified data, data mining is useful when dealing with data sets that are clearly defined. This also involves some element of machine learning. One of the best examples of this is the modern email system. Each mail provider has systems in place that determine spam and non-spam messages.

They are then filtered to the right inboxes.

● Text analysis

Most people are unaware of text analysis, especially since it is often viewed as a sub-group of other data analysis methods. Text analysis is basically reading messages to determine useful information from the content available. Beyond reading texts, the information is processed and passed through specific algorithms to help in decision making.

The nature and process of text analysis depends on the organization and their needs assessment. Information is obtained from different databases or file systems and processed through linguistic analysts. From there, it is easier to determine patterns in the information available, by looking at the frequencies of specific keywords. Pattern recognition algorithms usually look for specific targets like email addresses, street names, geographical locations, or phone numbers.

Text analysis is commonly applied in marketing, when companies crawl the websites of their competitors to understand how they run their business. They look for specific target words to help them understand why the competitor is performing better or worse than they are. This method can deliver competitor keywords and phrases, which the analyst can use to deduce a counter-mechanism
for their company.
Share:

Python in Healthcare

Python has attracted users from different platforms for its advantages in rapid application development and its dynamic building options. Little did we know that it is considered to be one of the “safest” programming languages too apart from all the popularity because of its functionality. Also, Python plays an important role in the healthcare sector.
Although it is tough for a normal person to believe that programming languages like Python are important in healthcare. There’s not only one reason that makes python in the limelight of healthcare. There are a number of reasons which make Python an important asset for the ones researching in the healthcare department.  In this post lets explore, what is the importance of Python in healthcare and why it is considered to be one of the safest languages.
Below are some important reasons for using Python in Healthcare sector, have a look –
  1. Python and its frameworks work on principles that are grounded equally and agreed upon the HIPAA checklist.
  2. A full look at the big data healthcare allows the organization to exchange information for seeking patient outcomes.
  3. The performance of the platforms made with Python is focused on their availability in phones and the web.

How Medical Startups are using Python?

  • Roam Analysis

Uses machine learning (artificial intelligence) and comprehensive contextual data to take over biopharmaceutical and medical device companies, that need to make decisions, suggest treatments and get the best possible patient outcomes.
Roam’s platform is powered by a proprietary data asset called Health Knowledge Graph which is continuously enriched using natural language processing to gather information and make connections with new data.
According to the official description, “Roam’s machine learning and data platform powers rich analysis of patient journeys to reveal the factors affecting treatment decisions and outcomes.”
  • AiCure

It is an NIH and VC-funded healthcare startup in New York. AiCure automates the process of ensuring that patients are taking their medicines, that too at their assigned time. It has combined artificial intelligence and mobile technology together.
For example, it uses computer vision to identify the patient (face recognition) and verify the right pill for the specific patient (using pill recognition) and they are actually consuming them (action recognition). They develop their backend applications using Django frameworks and use python base coding and research engineers to develop it.
Amazed by the magic of Python in Healthcare? Here are some more startups –
  • Drchrono

It is a software as a service patient care platform that consists of a web and cloud-based app. This American company is for doctors and patients that make electronic health records available digitally and provides management and medical billing services. This is also a phone and web concentrated platform.
  • Fathom Health

Again a Healthcare startup with deep learning NLP system for reading and understanding electronic health records. This startup with headquarters in San Francisco, California is backed up by Google. Their employees are familiar with Flask for API programming and data engineers prefer Python’s NLTK.

Summary

As we saw that Python is not only suitable for programming and web-based applications but it is also helpful in the healthcare sector. The Python is in the continuous development face. That is one of the major reasons why Python got such a big hand in the healthcare department. This application of Python will lead towards a better future and betterment of healthcare. Healthcare is a challenging field and Python is performing very well here.
Python role in Healthcare is just not limited here there are much more applications of Python that will lead towards a better and more high tech future. Many of us are unknown from the fact that Iron Man’s Jarvis is made up of Python.
Share:

Thursday, November 14, 2019

Meteoric rise of data science

Marked as one of the highest paying jobby Glassdoor, the field of Data science has witnessed an immense growth in recent years. Employers are in the search of data scientists more than ever. A report by Indeed indicated a 29% increase in the demand of data scientists in a year. However, the increase in people skilled in data science grew at a slower pace which observed a rise by 14% only. The gap in the demand and supply has increased the value of a data scientist in the market.

Not only is the rise of data science prevalent in the realm of software but also in fields of marketing, education and manufacturing. Internet acted as the catalyst in the very beginning. Actions taken by internet users were trackable which led to enormous amounts of stored data. This further encouraged research where computer scientists indulged in real-time analytics. They deciphered how people used certain products. 

One other reason which helped data science gain limelight was the rise in the related fields of Artificial Intelligence and Machine Learning.

The demand for professionals in data is increasing due to the rise in the popularity of data-driven decision making. Where people used Excel to work on data earlier, tools like Hadoop have nowadays secured a place for managing Big Data. With frequent advancement in technology, several other tools find meaningful use for organizations to make impactful decisions. The usage of a few is listed below:

  • Tools like Python and R have witnessed exceptional improvements in their codes. These allow users to solve complex problems with only a few lines of codes. 
  • Google Analytics is another effective tool for the marketing department.
  • Tools like Tableau, Microsoft Power BI and Sisense have found relevance in the business intelligence departments for the purpose of data visualization.

Apart from the above-mentioned uses, data science has proven to solve many complex real-time problems. A few examples of the real-world problems which have found peace in the modern-day data-driven solutions are:

  • Advancement in the field of data science has made it is easier to detect fraud and abuse in insurance firms now. The credit card fraud detection system works on similar grounds. It protects the security of customers, thus, minimizing losses due to fraud. 
  • Automated piloting: The concept behind self-driven vehicles runs on data science. Still in its nascent stage, this will change the functioning of the automobile industry entirely. 
  • Prediction of short term (local) and long term (global) weather. 

In addition to these, social networking sites like Facebook generates revenue via ad by showing content according to the user’s preference. They utilize users’ data to personalize their feed. On similar basis functions the system of ad recommendation of Amazon. Amazon stores the data of products searched by consumers and displays relevant ads on its website which attracts customers. Google, on the other hand, has redefined the data ecosystem by making use of it in every domain. From their search engine to Youtube to advertisements, everything runs on data.

A pinch of sugar in the ocean, these aren’t sufficient to describe what the vast field of data science incorporates. Even though the rise of data science isn’t sudden, the potential in this field is here to stay. Where until lately, only large enterprises were willing to invest in data scientists, now, almost every firm is. The growth is tremendous which obviously has a considerable effect on an individual’s growth aspects, the ones skilled in this discipline.

There are multiple factors involved in the meteoric rise of data science. First, the amount of data being collected keeps growing at an exponential rate. According to recent market research from the IBM Marketing Cloud (https://www- 01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=WRL12345GBEN) something like 2.5 quintillion bytes are created every day (to give you an idea of how big that is, that's 2.5 billion of billion bytes), but yet only a tiny fraction of this data is ever analyzed, leaving tons of missed opportunities on the table.

Second, we're in the midst of a cognitive revolution that started a few years ago; almost every industry is jumping on the AI bandwagon, which includes natural language processing (NLP) and machine learning. Even though these fields existed for a long time, they have recently enjoyed the renewed attention to the point that they are now among the most popular courses in colleges as well as getting the lion's share of open source activities. It is clear that, if they are to survive, companies need to become more agile, move faster, and transform into digital businesses, and as the time available for decision-making is shrinking to near real-time, they must become fully data-driven. If you also include the fact that AI algorithms need high-quality data (and a lot of it) to work properly, we can start to understand the critical role played by data scientists.

Third, with advances in cloud technologies and the development of Platform as a Service (PaaS), access to massive compute engines and storage has never been easier or cheaper. Running big data workloads, once the purview of large corporations, is now available to smaller organizations or any individuals with a credit card; this, in turn, is fueling the growth of innovation across the board.

For these reasons, there is no doubt that, similar to the AI revolution, data science is here to stay and that its growth will continue for a long time. But we also can't ignore the fact that data science hasn't yet realized its full potential and produced the expected results, in particular helping companies in their transformation into data-driven organizations. Most often, the challenge is achieving that next step, which is to transform data science and analytics into a core business activity that ultimately enables clear-sighted, intelligent, bet-the-business decisions.


Share:

Wednesday, November 13, 2019

Data science and its future

Data science refers to the activity of analyzing a large amount of data in order to extract knowledge and insight leading to actionable decisions.

Now you might ask what kind of knowledge, insight, and actionable decision are we talking about?

To orient the conversation, let's reduce the scope to three fields of data science:

• Descriptive analytics: Data science is associated with information retrieval and data collection techniques with the goal of reconstituting past events to identify patterns and find insights that help understand what happened and what caused it to happen. An example of this is looking at sales figures and demographics by region to categorize customer preferences. This part requires being familiar with statistics and data visualization techniques.

• Predictive analytics: Data science is a way to predict the likelihood that some events are currently happening or will happen in the future. In this scenario, the data scientist looks at past data to find explanatory variables and build statistical models that can be applied to other data points for which we're trying to predict the outcome, for example, predicting the likelihood that a credit card transaction is fraudulent in real-time. This part is usually associated with the field of machine learning.

• Prescriptive analytics: In this scenario, data science is seen as a way to make better decisions, or perhaps I should say data-driven decisions. The idea is to look at multiple options and using simulation techniques, quantify, and maximize the outcome, for example, optimizing the supply chain by looking at minimizing operating costs.

In essence, descriptive data science answers the question of what (does the data tells me), predictive data science answers the question of why (is the data behaving a certain way), and prescriptive data science answers the questions of how (do we optimize the data toward a specific goal).

Now another question which usually comes to our mind is whether data science is here to stay?

In the last decade, we've seen exponential growth in data science interest both in academia and in the industry, to the point it became clear that this model would not be sustainable. As data analytics are playing a bigger and bigger role in a company's operational processes, the developer's role was expanded to get closer to the algorithms and build the infrastructure that would run them in production. Another piece of evidence that data science has become the new gold rush is the extraordinary growth of data scientist jobs, which have been ranked number one for 2 years in a row on Glassdoor (https://www.prnewswire.com/news-releases/glassdoor-revealsthe-50-best-jobs-in-america-for-2017-300395188.html) and are consistently posted the most by employers on Indeed.

Headhunters are also on the prowl on LinkedIn and other social media platforms, sending tons of recruiting messages to whoever has a profile showing any data science skills. One of the main reasons behind all the investment being made into these new technologies is the hope that it will yield major improvements and greater efficiencies in the business. However, even though it is a growing field, data science in the enterprise today is still confined to experimentation instead of being a core activity as one would expect given all the hype. This has lead a lot of people to wonder if data science is a passing fad that will eventually subside and yet another technology bubble that will eventually pop, leaving a lot of people behind.


These are all good points, but people quickly realized that it was more than just a passing fad; more and more of the projects they were leading included the integration of data analytics into the core product features. Finally, it is when the IBM Watson Question Answering system won at a game of Jeopardy! against two experienced champions, that people became convinced that data science, along with the cloud, big data, and Artificial Intelligence (AI), was here to stay and would eventually change the  way we think about computer science.
Share:

Hacking with Python

By now you have a basic idea of how Python works and how programs were created using this programming language. Now, you are ready to learn how you can use Python scripts to compromise websites, networks, and more.

Learning how to hack entails being able to setup the right environment that you can work in, in order to develop your own exploitation tools. Since you have already installed Python and the standard library that comes with it, you are pretty much set up for hacking. All you need to do now is to install other tools and libraries that you can use for the exploits.

Third party libraries are essentially libraries that do not come native with your installation of Python. All you need to do to get them is to download them from a targeted source, perform uncompressing on the package that you just downloaded, and then change into the target directory.

As you might have already guessed, third party libraries are extremely useful when it comes to developing your own tools out of the resources that are already created by someone else. Since Python is a highly collaborative programming language, you can use libraries that you may find from website sources such as GitHub or the Python website and incorporate them into your code. There
Once you are inside the directory, you can install the downloaded package using the command python setup.py install. Take a look at this example to see how it is done:



What just happened here is that you were able to install a package that will allow you to parse nmap results by downloading the python-nmap package.

Now lets make a password cracker program which will help us to  understand how to perform hacking. This Python program will not only teach you how you can crack passwords, but also help you learn how to embed a library in your code and get results that you want.

To write this password cracker, you will need to have a crypt() algorithm that will allow you to hash passwords that are in the UNIX format. When you launch the Python interpreter, you will actually see that the crypt library that you need for this code is already right in the standard library. Now, to compute for an encrypted hash of a UNIX password, all you need to do is to call the function crypt.crypt() and then set password and salt as parameters. The code should return with a string that contains the hashed password. Here is how it should be done:


Now, you can try hashing a target’s password with the function crypt(). Once you are able to import the necessary library, you can now send the parameters salt “HX” and the password “egg” to the function. When you run the code, you will get a hashed password that contains the string “HX9LLTdc/jiDE”. This is how the output should look like:


When that happens, you can simply write a program that uses iteration throughout an entire dictionary, which will try against each word that will be possibly yield the word used for the password.

Now, you will need to create two functions that you can use in the program that you are going to write, which are testPass and main. The main function will pull up the file that contains the encrypted password, which is password.txt, and will then read all the contents in the lines that the password file contains. Afterwards, it will then split the lines into the hashed password and its corresponding username. After that, the main function will call the testPass function to test the hashed passwords against the dictionary.

The testPass function will take the password that is still encrypted as a parameter and then will return after exhausting the words available in the dictionary or when it has successfully decrypted the password. This is how the program will look like:




When you run this code, you will be able to see this output:



Judging from these results, you will be able to deduce that the password for the username ‘victim’ is right in the dictionary that you have available. However, the password for the username ‘root’ is a word that your dictionary does not contain. This means that the administrator’s password in the system that you are trying to exploit is more sophisticated, but can possibly be contained in another dictionary type.

At this point, you are now able to set up an ideal hacking environment for Python and learn how to
make use of available resources from other hackers. Now that you are able to create your first hacking tool, it’s time for you to discover how you can make your own hacking scripts!
Share: