Data cleaning and effective data screening

Aug 13 2024

Before conducting data analysis, operators need to clean the collected data sets, otherwise the results of data analysis will be inaccurate due to the presence of outliers.

Taking the data in the “Product Portrait” Excel file as an example, operators can use the “Filter” function of Excel to check whether there are outliers in the data of each dimension. The operation steps are: open the “Product Portrait” Excel table, select the first row of data, select the “Home” tab in the menu bar, and then click the “Sort and Filter” drop-down button, and select the “Filter” option in the pop-up drop-down menu.

When the Excel operation interface generates the “Filter” function symbol (a small downward arrow), the operator can click the “Filter” symbol of each dimension data to check whether there are outliers in the dimension data. Taking “Number of Ratings” as an example, after clicking the “Filter” symbol, the operator will find that there is an outlier value of “-1” in the dimension data. This type of outlier is caused by the Python crawler script failing to complete the crawling when crawling data.

Generally speaking, outliers within 5% of the total data volume will not affect the accuracy of data analysis. When the operator screens the outliers of “number of ratings” through the above steps, it can be found that there are 23 outliers in total (Excel shows 24 rows, but because the first row is the data dimension name, the total outliers are 24-1=23).

The operator can repeat the above steps to delete the outliers of each dimension one by one. It should be noted that when processing outliers in the ranking dimension, only the “Clothing, Shoes & Jewelry” dimension needs to be processed, and the “Women’s Shops” dimension does not need to be processed. This is because the latter is a small category ranking compared to the former, and some products belong to the large category but not the small category. Therefore, when performing preliminary data cleaning, you only need to pay attention to the data of the large category ranking.

After cleaning the abnormal data in the five dimensions of “Number of Ratings”, “Exposure Price”, “Product Price Range (Lowest Price)”, “Product Price Range (Highest Price)”, and “Clothing, Shoes & Jewelry”, the final data can be obtained and the cleaned data can be saved in an Excel file named “Cleaned Product Portrait Data”.

Click a column in the data table at random, and you can find that the data set contains 18,660 ASIN data. Compared with the original data of 19,152, a total of 492 ASINs with abnormal values have been deleted. After completing a series of data cleaning steps, you can start data analysis.

Posted inBlog

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Data cleaning and effective data screening

Leave a Reply Cancel reply