$ nohup wget https://data.iowa.gov/api/views/ykb6-ywnd/rows.csv?accessType=DOWNLOAD -O store.csv &
$ nohup wget https://data.iowa.gov/api/views/gckp-fe7r/rows.csv?accessType=DOWNLOAD -O product.csv &
$ nohup wget https://data.iowa.gov/api/views/m3tr-qhgy/rows.csv?accessType=DOWNLOAD -O iowaliquor.csv &
$ sed -E "s#([0-9]{2})/([0-9]{2})/([0-9]{4})#\3-\1-\2#" < iowaliquor.csv | tr -d '$' > iowa-liquor-datefixed.csv
$ sed -E "s#([0-9]{2})/([0-9]{2})/([0-9]{4})#\3-\1-\2#" < store.csv | tr -d '$' > store-datefixed.csv
$ sed -E "s#([0-9]{2})/([0-9]{2})/([0-9]{4})#\3-\1-\2#" < product.csv | tr -d '$' > product-datefixed1.csv
$ sed -E "s#([0-9]{2})/([0-9]{2})/([0-9]{4})#\3-\1-\2#" < product-datefixed1.csv | tr -d '$' > product-datefixed.csv
$ spark-submit ./src/Data_Collecting/dataclean.py <sale|product|store> <inputs> <output>
| where “sale | product | store” is the path of sale or product or stroe table in csv format, and “output” is the path of Hadoop/S3 where you want the outputs to be stored. The output is the data file after cleaning. |
This is the producer of Kafka.
$ nohup ./src/Data_Collecting/check_update.py &
This is the consumer of Kafka
$ nohup ./src/Data_Collecting/apply_update.py <output> &
where “output” is the path of the location where the newly collected data will be updated to. There is a timeout to stop the loop, which is for test/debugging use. Can be removed when put into use.
$ spark-submit ./src/Overview_Sale_By_Month/total_sales_by_month.py <inputs> <output>
where “inputs” is the path of sales table, and “output” is the path of location where you want the outputs to be stored. The output is the aggregated sale data by month, which can be visualized and can also be fed into train_pref.py
$ spark-submit ./src/Overview_Sale_By_Month/train_pred.py <inputs> <modelfile> <output>
where “inputs” is the output of total_sales_by_month.py, “modelfile” is the path of location where you want the model files to be stored, and “output” is the path of location where the predicted sales will be stored
$ spark-submit ./src/Q1_Growth_Rate/variance.py <inputs> <output>
where “inputs” is the path where there are three folders: sale, product, and store. In this repo’s case, it’s the ./cleaned_data folder. “outputs” is the path of location where the output files will be stored. The output includes:
$ spark-submit ./src/Q2_RFM_Cluster/RFM.py <saleData file> <output>
where “saleData file” is the path of sale table in parquet format, and “output” is the path of location where you want the outputs to be stored. The output is the store number with its RFM segmentation and score combined with the prediction cluster after using kmeans algorithm.
$ spark-submit ./src/Q2_RFM_Cluster/joinGeoRfm.py <storeData file> <RFM file> <output>
where “storeData file” is the path of store table in parquet format, “RFM file” is the path of RFM file in csv format, and “output” is the path of Hadoop/s3 where you want the outputs to be stored. The output is RFM file combine with the geographic coordinates.
$ nohup ./src/Q2_RFM_Cluster/DrawMap.py <inputs> & where "inputs" is the path of RFM modle with GEO in csv format
$ spark-submit ./src/Q3_Optimization_problem/optimization_problem.py <input_1> <input_2>
where “input_1” is the path to (normalized) folder which contains sales data(in parquet files) i.e. “./cleaned_data/sale” and “input_2” is the path to folder which contains product data i.e. “./cleaned_data/product”. And the output images will be generated in folder “./Q3_Optimization_problem/”