Question 1

Below is the example for my use case.

enter image description here

Question 2

You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.

UDF:

package something;import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;public class RemoveSequentialDuplicatesUDF extends UDF {public ArrayList<Text> evaluate(ArrayList<Text> arr) {ArrayList<Text> newList = new ArrayList<Text>();newList.add(arr.get(0));for (int i = 1; i < arr.size(); i++) {String front = arr.get(i).toString();String back  = arr.get(i-1).toString();if (!back.equals(front)) {newList.add(arr.get(i));}}return newList;}
}

To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:

Query:

add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";select screen_flow, count, dense_rank() over (order by count desc) rank
from (select screen_flow, count(*) countfrom (select session_id, concat_ws("->", remove_dups(screen_array)) screen_flowfrom (select session_id, collect(screen_name) screen_arrayfrom (select *from database.tableorder by screen_launch_time ) agroup by session_id ) b) cgroup by screen_flow ) d

Output:

s1->s2->s3      2       1
s1->s2          1       2
s1->s2->s3->s1  1       2

Hope this helps.

how to find the pathing flow and rank them using pig or hive?

Related Q&A

Total beginner wrote a tic tac toe game in Python and would like some feedback

Separating tag attributes as a dictionary

How can I remove duplication of 2 separate which is interrelated with each other (PYTHON)

Array within an array?

How to create a zoned of gradation area on the edge of ROI in opencv python

Prevent Terminal resize python curses

SymPy Not Doesnt Return LaTeX

Python Flask: How to include JavaScript file for each template per blueprint

Difference between multiple elements in list with same string . Python 2.7 [closed]

EDX Course API: Getting EDX course list