how to find the pathing flow and rank them using pig or hive?

2024/9/20 19:49:30

Below is the example for my use case.

enter image description here

enter image description here

Answer

You can reference this question where an OP was asking something similar. If I am understanding your problem correctly, you want to remove duplicates from the path, but only when they occur next to each other. So 1 -> 1 -> 2 -> 1 would become 1 -> 2 -> 1. If this is correct, then you can't just group and distinct (as I'm sure you have noticed) because it will remove all duplicates. An easy solution is to write a UDF to remove those duplicates while preserving the distinct path of the user.

UDF:

package something;import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;public class RemoveSequentialDuplicatesUDF extends UDF {public ArrayList<Text> evaluate(ArrayList<Text> arr) {ArrayList<Text> newList = new ArrayList<Text>();newList.add(arr.get(0));for (int i = 1; i < arr.size(); i++) {String front = arr.get(i).toString();String back  = arr.get(i-1).toString();if (!back.equals(front)) {newList.add(arr.get(i));}}return newList;}
}

To build this jar you will need a hive-core.jar and hadoop-core.jar, you can find these here in the Maven Repository. Make sure you get the version of Hive and Hadoop that you are using in your environment. Also, if you plan to run this in a production environment, I'd suggest adding some exception handling to the UDF. After the jar is built, import it and run this query:

Query:

add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/jars/hive_common-SNAPSHOT.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";
create temporary function remove_dups as "something.RemoveSequentialDuplicatesUDF";select screen_flow, count, dense_rank() over (order by count desc) rank
from (select screen_flow, count(*) countfrom (select session_id, concat_ws("->", remove_dups(screen_array)) screen_flowfrom (select session_id, collect(screen_name) screen_arrayfrom (select *from database.tableorder by screen_launch_time ) agroup by session_id ) b) cgroup by screen_flow ) d

Output:

s1->s2->s3      2       1
s1->s2          1       2
s1->s2->s3->s1  1       2

Hope this helps.

https://en.xdnf.cn/q/119464.html

Related Q&A

Total beginner wrote a tic tac toe game in Python and would like some feedback

Ive decided to learn Python about 2 weeks ago, been going through various books and videos, and Ive decided to try my hand at programming a tic tac toe game. I was somewhat successful (it doesnt recogn…

Separating tag attributes as a dictionary

My entry (The variable is of string type): <a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>My expected output: { href: https://wikipedia.org/, rel: nofollow …

How can I remove duplication of 2 separate which is interrelated with each other (PYTHON)

After reading so many title, I couldnt solved the problem below. Does anyone can help me please ? For instance, I have 2 list (list_I and list_II) which is interrelated with each other. list_I = [123,…

Array within an array?

Im trying to call up an element from an array within an array in Python. For example:array = [[a1,a2,a3,a4], [b1,b2,b3,b4], [c1,c2,c3,c4]]The question is, how would I print just the value b1?

How to create a zoned of gradation area on the edge of ROI in opencv python

I have a binary image (white and black), the where Region of Interest (ROI) is black. The shape of ROI is irregular and the location of ROI can be anywhere in the frame. I want to have a smooth gradati…

Prevent Terminal resize python curses

Im writing a program on python curses and I was wondering if there is a way to block terminal resizing in order to prevent curses crashing both on Linux and Windows. This is what happens.. Can I preven…

SymPy Not Doesnt Return LaTeX

Helloo! So, Im using SymPy to make a calculation for me. The trouble is, its output should be a LaTeX expression and in make case it prints something like SymPy Calculation Output Is there any way to s…

Python Flask: How to include JavaScript file for each template per blueprint

I have read Loading external script with jinja2 template directive and Import javascript files with jinja from static folder but unfortunately no closer I have a Python Flask site which is based on htt…

Difference between multiple elements in list with same string . Python 2.7 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.Questions asking for code must demonstrate a minimal understanding of the problem being solved. Incl…

EDX Course API: Getting EDX course list

I am making a project in python/flask. I want to get a list of all the courses of edx. But the API provides the list page by page. I cant figure out how to get the entire list. Any help is appreciated.…