INTRODUCTION TO DATA WRANGLING

Ikponmwosa Esther
3 min readJun 30, 2020

--

White paper with numbers

Data wrangling sometimes called data munging is a very important part of data science. I’m going to illustrate it in the simplest way I can to make us understand it. Let’s take for example, you rented a new house, before you move in, you first have to clean up the house, get rid of any unwanted thing inside the house and ensure it’s clean. Data wrangling is similar to this illustration, when we are giving a messy data, the first thing we do is to clean the data so it will be easy to analyse.

On this, I’ll be introducing us to some things on data wrangling to make us better understand it. They include:

⦁ What is data wrangling

⦁ Importance of data wrangling

⦁ The core activities of data wrangling

⦁ Conclusion

WHAT IS DATA WRANGLING

According to Wikipedia, data wrangling is the process of transforming and mapping data from one raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstairs purposes such as analytics.

From this definition, we can say that data wrangling is the act data preparation, it includes a set of tasks that need to be done to better understand and prepare your data.

IMPORTANCE OF DATA WRANGLING

The importance of data wrangling can not be overlooked. In general, it prepare the data for use. What I mean here, is that, data are discovered, structured, cleaned, enriched, validated, stored and published.

CORE ACTIVITIES OF DATA WRANGLING

The activities listed above can be done using the pandas. Pandas is library built on Numpy and is one of the most popular python library for data wrangling.

⦁ Importing or exporting data: data are stored in different format(CSV, JSON,XML, Excel, SQL, etc). This is all about discovering(getting to know your data in terms pattern and correlation) or acquiring data and loading data from files they are stored in as a dataframe.

⦁ Data manipulation: this includes sorting, merging, grouping and altering the data. Data manipulation is the next step to take after acquiring and loading the data. We have to sort out data either in ascending or descending order, merge or concatenate dataframes that needs to be merged or concatenated, group them and if there is any column that needs to be renamed, rename it.

⦁ Data cleaning: this the act of cleaning data for missing data, duplicate data, finding outliers and unwanted data. If all this is not properly cleaned, it will affect the accuracy of the result.

⦁ Data analysis: here, this is where we check what new types of data can be derived from the data we already have or considering what better information would improve my decision making about the data and also validate the data to make sure they a correct and make sense.Validation rules are repetitive programming sequences that verify data consistency, quality, and security.

⦁ Data storage: after doing all listed above, the next thing is to save it, be published.

CONCLUSION

Now we can see why data wrangling is so important in data science and should not be overlooked. Without clean and robust data, there is no Data Science.

--

--