I'm writing some ETL flows in Python that, for part of the process, use Hive. Cloudera's impyla client, according to the documentation, works with both Impala and Hive.
In my experience, the client worked for Impala, but hung when I tried to connect to Hive:
from impala.dbapi import connectconn = connect(host='host_running_hs2_service', port=10000, user='awoolford', password='Bzzzzz')
cursor = conn.cursor() <- hangs here
cursor.execute('show tables')
results = cursor.fetchall()
print results
If I step-into the code, it hangs when it tries to open a session (line #873 of hiveserver2.py).
At first, I suspected that a firewall port might be blocking the connection, and so I tried to connect using Java. To my surprise, this worked:
public class Main {private static String driverName = "org.apache.hive.jdbc.HiveDriver";public static void main(String[] args) throws SQLException {try {Class.forName(driverName);} catch (ClassNotFoundException e) {e.printStackTrace();System.exit(1);}Connection connection = DriverManager.getConnection("jdbc:hive2://host_running_hs2_service:10000/default", "awoolford", "Bzzzzz");Statement statement = connection.createStatement();ResultSet resultSet = statement.executeQuery("SHOW TABLES");while (resultSet.next()) {System.out.println(resultSet.getString(1));}}
}
Since Hive and Python are such commonly used technologies, I'm curious to know if anyone else has experienced this problem and, if so, what did you do to fix it?
Versions:
- Hive 1.1.0-cdh5.5.1
- Python 2.7.11 | Anaconda 2.3.0
- Redhat 6.7