3_datamen_CMPT_732_project

CMPT 732 Big Data Project Report

Problem Definition:

What is the problem that you are trying to solve? What are the challenges of this problem?

When any business owner decides to open a new store(or expanding an existing business) at new city or a new neighbourhood, It can be challenging for business owners to decide on the best location/products for their stores so that they can make more profits.

Also, when a Vendor salesmen approach business owners asking them to keep/sell their products. A good analysis is needed to check how these products are doing in the market for that city.

Analysis on above questions is necessary to take decisions such as the optimal location for their business, what products they will sell, and how to increase sales when starting a new or expanding an existing business in a new city.

Therefore, by examining the past sales data, our initiative intends to provide analysis on above mentioned questions to inform them with knowledge that can assist them in making profitable decisions.

Now, we plan to test the this strategy on the “Liquor sales” dataset presented by Iowa Govt.

Also, the above strategy can be extended to provide these analysis to stores in other domains. For example, if Safeway plans to open a new store in a new city, we can provide them with answers to above questions and help them maximize their profits.

Methodology:

What is the problem that you are trying to solve? Briefly explain which tool(s)/technique(s) were used for which task and why you chose to implement that way.

As we mentioned, we used “IOWA Liquor sales data” published by IOWA Government. Now, after obtaining the data, the next task was to clean it, extract-transform-load it into a Data Warehouse and then eventually into a Data Mart, and then generate reports like grouping the most popular drink by region. And the final stage i.e. to visualize the data and make it presentable to the clients.

  1. Brief on Data Acquiring and Cleaning

    The Iowa Alcoholic Beverages Division receives reports on wholesale transactions, and they make the data (from 2012 to the present) accessible at: https://data.iowa.gov/.

    The iowa-liquor dataset is a 5.7 GB file. Through analysis, we found that this table contains store information, product information and transaction information. Among them, there are a lot of redundant and wrong stores data, such as incomplete names, multiple names for the same store, etc. In addition, the geographical location is stored as coordinate point data instead of latitude and longitude, and there are errors in the date format.

    In order to facilitate analysis and remove redundant and erroneous data, we used the official store table and product table to replace the original data(by dropping the columns from original Table). In this way, while proofreading the data, it also normalized the data. After solving all the mentioned problems, the data normalizes into three tables namely sale, store and product. Each table can be updated individually via Kafka in real-time. It is convenient for subsequent analysis and mining.

    Check More Details here

  2. Overview of Historical Sales Data and Its Trend

    Before we dive into any deeper analysis, we would like to have an overview of the historical sales data, and see whether there is a particular pattern of how the total sale changes.

    By using Spark we aggregated the sales data by year and month, and we found there obviously is a pattern: the amount of sale changes in similar way within each year, and the overall trend is steadily increasing.

    Check More Details here

  3. Some Questions We Intend to Answer

    Question 1.

    (a) Market can be segmented in different ways, for example: by cities, mid-to-high-end liquors, liquor categories, and so on; sales growth rate of market segments under each of these scenarios may have different distributions. We would like to explore which of the scenarios can divide the market most effectively, that is, to reveal the variantion of different market segments, so that vendors can know what to take into consideration when they want to expand their business

    (b) Display growth rate distributions under the Top 5 segmentation scenarios

    Question 2.

    (a) Use the RFM method to model merchants and generate features based on the results.

    (b) Use the kmeans method to classify merchants and use geographic coordinates to observe whether the classification results are geographically meaningful

    Question 3.

    Assuming that the capital at hand is certain, how can we try to maximize the profit?

    So, if the business owners have limited capital and limited inventory space which products would be the best ones to invest in or buy and keep in inventory. I.e to find the products that give maximum profits, at the same time, keeping in mind that the inventory space is limited. Therefore, optimizing both the constraints.

Proposed Solution Design:

Alt text

Problems:

What problems did you encounter while attacking the problem? How did you solve them?

Results:

What are the outcomes of the project? What did you learn from the data analysis? What did you learn from the implementation?

Project Summary:

A summary of what you did to guide our marking.

Total: 20

Project Contributors: Song Lin (sla475, 301550328), Shlok Nangia (sna101, 301550257), Jialiang Guo (jga169, 301558771)