Saturday, November 23, 2019

Shortcomings of Analyzing Data in Python

Given all the buzz about data and data analysis, it might come as a surprise to a lot of people, but data analysis does have unique challenges that are impeding the expected deliverables. One of the biggest challenges that data analysts have to work through is the fact that most of the data they rely on are user-level based.

Because of this reason, there is room for a lot of errors which eventually affects the credibility of the data and reports obtained therefrom. Whether in marketing or any other department in the business that relies on data, the unpredictability of user-level data means some data will be relevant to some
applications and projects, but not all the time. This brings about the challenge of using and discarding data, or alternatively keeping the data and updating it over time.

While Python offers these benefits, it is also important to be aware of some of the challenges and limitations you might experience when programming in Python. This way, you know what you are getting into, and more importantly, you come prepared. Below we will discuss some of the challenges that arise for data analysts when they have to work with this kind of data.

● Input bias

One of a data analyst’s biggest concerns revolves around the reliability of the data at their behest. Most of the data they have access to, especially at the data collection points like online ads from the company, are not 100% reliable. Contrary to what is expected, this input does not usually present the true picture of events concerning the interaction between customers and the brand. Today there are several ways data analysts can try to obtain credible and accurate information about customers. One of these is through cookies that can be tracked. Cookies might present some data, but the accuracy of the data will always come under scrutiny.

Think about a common scenario where you have different devices, each of which you can use to go online and check some information about your favorite brand. From this example, it is not easy to determine the point at which the sale was made. All the devices belong to you, but you could have made a purchase decision from one of them, but used another to proceed with the purchase. This
level of fragmentation makes it difficult to effectively track customer data. It is likely that data obtained from customers who own different devices will not be accurate. Because of this reason, there is always the risk of using inaccurate data.

● Speed

By now, it is no secret that Python is relatively slower compared to majority of the programming languages. Take C++ for example, which executes code faster than Python. Because of this reason, you might need to supplement the speed of your applications. Many developers introduce a custom runtime for their applications which is more efficient than the conventional Python runtime.

In data analysis, speed is something that you cannot take for granted, especially if you are working with a lot of time-sensitive data. Awareness of the speed challenges you might encounter in Python programming should help you plan your work accordingly, and set realistic deliverables.

● Version compatibility

If there is a mundane challenge that you will experience in Python it is version compatibility. Many programmers consider this a mundane issue, but the ramifications are extensive. For beginner data analysts, one of the challenges is settling on the right Python version to learn. It is not an easy experience, especially when you know there is something better already.

By default, programmers consider Python version 2 as the base version. In case you need to advance to futuristic data analysis, Python version 3 is your best bet. Generally, you will receive updates to either of the versions whenever they are available. However, when it comes to computations and executing code, some challenges might arise. A lot of programmers and data analysts still prefer the
second version over the current one. This is because some of the common libraries and framework packages only support the second version.

● Porting applications

For a high-level programming language, you must use interpreters to help you convert written code into instructions that the operating system can understand better. In order to do this, you will often need to install the correct interpreter version in your computer to help in handling the applications. This can be a problem where you are unable to port one application to a different platform.

Even if you do, the porting process hardly ever goes smoothly.

● Lack of independence

For all the good that can be done with Python, it is not an independent programming language. Python depends on third-party libraries, packages, and frameworks to enable you to analyze data accordingly. Other programming languages that are available in the market today come with most of the features bundled in already, unlike Python. Any programmer interested in analyzing data in Python must make peace with the fact that they will have to use additional libraries and frameworks. This comes with unique challenges, because the only way out is to bring in open-source dependencies.

Without that, legacy dependencies would consume a lot of resources, increasing the cost of the analysis project.

● Algorithm-based analysis

There are two acceptable methods that data analysts use to study and interpret a given set of data. The first method is to analyze a sample population, and draw conclusive remarks from the assessment of the sample about the population. Given that the approach covers only a sample, it is possible that the data might not be a true representation of what the greater population is about. Samples can easily be biased, which makes it difficult to get the true version of events.

The second approach is to use algorithms to understand information about the population. Running algorithms is a better method because you can study an entire population based on the algorithm syntax. However, algorithms do not always provide the most important contextual answers. Either of the methods above will easily present actionable recommendations. However, they cannot give you answers to why customers behave a certain way.

Data without this contextual approach can be unreliable because it could mean any of a number of possibilities. For the average user, reports from algorithms will hardly answer their most pressing questions.

● Runtime errors

One of the reasons why Python is popular is because of its dynamism. This is a language that is as close to normal human syntax as possible. As a result, you will not necessarily have to define a variable before you call it in your code. You will, therefore, write code without struggling as you would in other languages like C#.

However, even as you enjoy easy coding in Python, you might come across some errors when compiling your code. The problem here arises because Python does not have stringent rules for defining variables. With this in mind, you must run a series of tests whenever you are coding to identify errors and fix them at runtime. This is a process that will cost you a lot of time and financial resources.

● Outlier risks

In data analysis, you will come across outliers from time to time. Outliers will have you questioning the credibility of your data, especially when you are using raw user data. If a single outlier can cast doubt on the viability of the dataset, imagine the effect of several outliers.

More often you will come across instances where you have weird outliers. It is not easy to interpret them. For example, you might be looking at data about your website, only to realize that for some reason, there was a spike in views during a two-hour period, which cannot be explained. If something like this happens, how can you tell whether the spike represents individual hits on your website or
whether it was just one user whose system was perhaps hacked, or experienced an unprecedented traffic spike?

Certainly, such data will affect your results if you use them in your analysis. The important question is, how do you incorporate this data into your analysis? How can you tell the cause behind the spike in traffic? Is this a random event, or is it a trend you have observed over time? You might have to clean the data to ensure you capture correct results. However, what if by cleaning the data, you assume that the spike was an erroneous outlier, when in real sense the data you are ignoring was legitimate?

● Data transfer restrictions

Data is at the heart of everything today. For this reason, it is imperative that companies do all they can to protect the data at their disposal. If you factor in the stringent data protection laws like the GDPR, protecting databases is a core concern in any organization.

Data analysts might need to share data or discuss some data with their peers, but it is impossible to do this. Access to specific data must be protected. It is therefore impossible to share data across servers or from one device to another. If you delve further into big data, most organizations do not have employees with the prerequisite skills to handle such data efficiently. As a result, data administrators must restrict the number of people who can interact with such data.

In light of these restrictions, most of the work done in analysis and recommendations thereof is the prerogative of the data analyst. There is hardly ever a second opinion because very few people have access to the database or data with similar rights. This also creates a problem where users or members of the organization are unable to provide a follow-up opinion on the data. They do not know the procedures or assumptions the analyst used to arrive at their conclusions. In the long run, data can only be validated by one or a few people in the organization who took part in the analysis. This kills the collaborative approach where it should have been allowed to thrive.


Python has enjoyed an amazing library support. In the next post we shall discuss about Python Libraries for Data Analysis.
Share:

0 comments:

Post a Comment