Big Data Case Study: E-commerce Data Analysis using Hadoop

Published on 18 September 2020 12:51 AM
This post thumbnail

Nowadays, data is important for every organization because more data means more business but data is so big that we need big appliances to store it and even if we get big appliances for it, we will still need more as there is no physical appliance made till now which can store big data. So, how do big MNCs like Google, Facebook, Instagram and Amazon do it and store forever? Facebook alone generates 4PB (Peta Bytes) of data daily as per 6 Nov 2020 as reported by Kinsta. That's why they say, Big Data knows everything. Now, the problem is not only to store data but also to perform operations on it and that too very fast because today's world is agile and every second matters.

What is Big Data?

To store this much amount of data, none of the storage devices companies has that capacity of devices due to IOPS (Input-Output Speed) limits. Therefore, Problem of having data more than the storage capacity is called Big Data.

Big data has two problems which are described by 2Vs namely Volume and Velocity. Here volume means size of data which is being generated and is to be stored and on the other hand velocity means the speed at which data is processed and it produces the output. Both of these problems are solved by creating Distributed Storage Clusters in master slave architecture where data is stored in different systems and with the power of parallel processing the problem of speed of storing data on the servers and processing data is also solved. Also, for storing data in distributed storage clusters, we can easily do it using cloud computing.

What is E-commerce?

E-commerce is the category name of online shopping whether from a website or an app where products are listed and you can simply select whatever you want according to your choice and buy them hassle free sitting comfortably no matter wherever you are in any part of the world. You have to just login with your credentials and enter the details after selecting mode of payment, the product is delivered at your doorstep. For example - Amazon, Flipkart etc.

Need of Data Analysis

E-commerce companies always want to maintain the quality of products as well as sales by constantly reducing number of fraudulent activities through their platform by investing and analyzing customer behaviour. These companies receives lots of data about different products, registered users and behaviour of users in terms of placing orders and subsequent actions made on the orders. Now, in this data, different products belong to different categories and every product has have different amount of discounts and profit percentages on it. Also, users from different locations have different behavious and based on this, these e-commerce websites try to detect purchase pattern of users and find out possible fraud activites.

Now, Analysis is always done on historical data, i.e. past data which is already present or is already collected by the organization. On the basis of your past purchases or your buying behaviour, companies can also recommend you similar items from the same category which have some good amount of discount at that time or on the analysis of some items bought together by many customers again and again, it will recomment you to buy those items together too if you are trying to buy one of those which ultimately increases their chances of getting more sales and more sales means more profit for the company.

What is Hadoop?

Hadoop is an Open Source tool by Apache Software Foundation which is most used in big data technology to create and manage distributed storage clusters where data is stored and processed. Hadoop uses HDFS (Hadoop Distributed File System) Protocol to contact the nodes and manage the cluster. For any big data project, a proper setup is required and hadoop helps us to create that setup and provides us master slave architecture after connecting all the nodes to each other. It stores the data in fixed size blocks in the nodes and it also creates multiple replicas of the data stored in the nodes so that even if any node goes off or gets crashed, it should not affect the data being read or the operations being performed on the data.

What type of data is collected?

  • Product's Info: This contains all the information about the products listed on the e-commerce platform. The data includes product id, name, seller, category, selling price, discount, profit percent of e-commerce platform etc.

  • User's Activity: This contains different types of activites performed by a user which contains all the details of different orders made by a user and the cancelled/returned orders are also included like product id, name of user, user id, cancelled or not, returned or not, if cancelled, why? or if returned, why?, date of order, date of shipment, date of delivery, date of cancellation, date of return etc.

  • User's Info: This contains all the information about the users registered on the e-commerce platform. This data includes user id, name, location, age, occupation, details of orders bought or cancelled or returned by the user etc.

Data Analysis

All this data is stored in databases. After that data is cleaned, duplicate data is removed and null values are fixed for further operations. Then analysis is done on the data collected using various analysis tools and techniques. After the analysis, some of the required outputs by these platforms are as follows:

Purchase Pattern Detection

  • What is the most purchased category for every user? Then identify the users with a maximum amount of valid purchases in order to recommend them more products from same category
  • Which products are generating the maximum profit? so that they can focus on selling those products and other products in the same category more in order to get more money
  • Which sellers are generating maximum profit? so that they can focus more on them by providing them more offers to retain them for longer period

Fraud Detection

  • Which user or which seller has the most amount of returns? What are the valid purchases made by that user or what are valid purchases made from that seller?
  • Which location is getting most number of cancellations? or Which location is getting most number of returns?
  • Which users purchase pattern has been changed more than 50% than last month corresponding to their top 3 purchases?
  • What is the net worth of cancelled or returned orders across every city for different types of reasons?

Conclusion

Based on the Above results e-commerce companies try to reduce frauds and bring more sales for them so that they can retain customers as well as sellers for a longer period of time and gain their loyalty for their brand which is helpful for them in building a reputable name in the world. The data generated is so big that only hadoop and big data technologies can handle it with faster response time.