Connect and retrieve data from databases
Kotlin Notebook offers capabilities for connecting to and retrieving data from various types of SQL databases, such as MariaDB, PostgreSQL, MySQL, and SQLite. Utilizing the Kotlin DataFrame library, Kotlin Notebook can establish connections to databases, execute SQL queries, and import the results for further operations.
For a detailed example, see the Notebook in the KotlinDataFrame SQL Examples GitHub repository.
Before you start
Download and install the latest version of IntelliJ IDEA Ultimate.
Install the Kotlin Notebook plugin in IntelliJ IDEA.
Create a new Kotlin Notebook by selecting File | New | Kotlin Notebook.
Ensure you have access to an SQL database, such as MariaDB or MySQL.
Connect to database
You can connect to and interact with an SQL database using specific functions from the Kotlin DataFrame library. You can use DatabaseConfiguration
to establish a connection to your database and getSchemaForAllSqlTables()
to retrieve the schema of all tables within it.
Let's look at an example:
Open your Kotlin Notebook file (
.ipynb
).Add a dependency for a JDBC (Java Database Connectivity) driver, and specify the JDBC driver version. This example uses MariaDB:
USE { dependencies("org.mariadb.jdbc:mariadb-java-client:$version") }Import the Kotlin DataFrame library, which is essential for data manipulation tasks, along with the necessary Java libraries for SQL connectivity and utility functions:
%use dataframe import java.sql.DriverManager import java.util.*Use the
DatabaseConfiguration
class to define your database's connection parameters, including the URL, username, and password:val URL = "YOUR_URL" val USER_NAME = "YOUR_USERNAME" val PASSWORD = "YOUR_PASSWORD" val dbConfig = DatabaseConfiguration(URL, USER_NAME, PASSWORD)Once connected, use the
getSchemaForAllSqlTables()
function to fetch and display the schema information for each table in the database:val dataschemas = DataFrame.getSchemaForAllSqlTables(dbConfig) dataschemas.forEach { println("---Yet another table schema---") println(it) println() }
Retrieve and manipulate data
After establishing a connection to an SQL database, you can retrieve and manipulate data in Kotlin Notebook, utilizing the Kotlin DataFrame library. You can use the readSqlTable()
function to retrieve data. To manipulate data, you can use methods, such as filter
, groupBy
, and convert
.
Let's look at an example of connecting to an IMDB database and retrieving data about movies directed by Quentin Tarantino:
Use the
readSqlTable()
function to retrieve data from the "movies" table, settinglimit
to restrict the query to the first 100 records for efficiency:val dfs = DataFrame.readSqlTable(dbConfig, tableName = "movies", limit = 100)Use an SQL query to retrieve a specific dataset related to movies directed by Quentin Tarantino. This query selects movie details and combines genres for each movie:
val props = Properties() props.setProperty("user", USER_NAME) props.setProperty("password", PASSWORD) val TARANTINO_FILMS_SQL_QUERY = """ SELECT name, year, rank, GROUP_CONCAT(genre) as "genres" FROM movies JOIN movies_directors ON movie_id = movies.id JOIN directors ON directors.id=director_id LEFT JOIN movies_genres ON movies.id = movies_genres.movie_id WHERE directors.first_name = "Quentin" AND directors.last_name = "Tarantino" GROUP BY name, year, rank ORDER BY year """ // Retrieves a list of Quentin Tarantino's movies, including their name, year, rank, and a concatenated string of all genres. // The results are grouped by name, year, rank, and sorted by year. var dfTarantinoMovies: DataFrame<*> DriverManager.getConnection(URL, props).use { connection -> connection.createStatement().use { st -> st.executeQuery(TARANTINO_FILMS_SQL_QUERY).use { rs -> val dfTarantinoFilmsSchema = DataFrame.getSchemaForResultSet(rs, connection) dfTarantinoFilmsSchema.print() dfTarantinoMovies = DataFrame.readResultSet(rs, connection) dfTarantinoMovies } } }After fetching the Tarantino movies dataset, you can further manipulate and filter the data.
val df = dfTarantinoMovies // Replaces any missing values in the 'year' column with 0. .fillNA { year }.with { 0 } // Converts the 'year' column to integers. .convert { year }.toInt() // Filters the data to include only movies released after the year 2000. .filter { year > 2000 } df
The resulting output is a DataFrame where missing values in the year column are replaced with 0 using the fillNA
method. The year column is converted to integer values with the convert
method, and the data is filtered to include only rows from the year 2000 onwards using the filter
method.
Analyze data in Kotlin Notebook
After establishing a connection to an SQL database, you can use Kotlin Notebook for in-depth data analysis utilizing the Kotlin DataFrame library. This includes functions for grouping, sorting, and aggregating data, helping you to uncover and understand patterns within your data.
Let's dive into an example that involves analyzing actor data from a movie database, focusing on the most frequently occurring first names of actors:
Extract data from the "actors" table using the
readSqlTable()
function:val actorDf = DataFrame.readSqlTable(dbConfig, "actors", 10000)Process the retrieved data to identify the top 20 most common actor first names. This analysis involves several DataFrame methods:
val top20ActorNames = actorDf // Groups the data by the first_name column to organize it based on actor first names. .groupBy { first_name } // Counts the occurrences of each unique first name, providing a frequency distribution. .count() // Sorts the results in descending order of count to identify the most common names. .sortByDesc("count") // Selects the top 20 most frequent names for analysis. .take(20) top20ActorNames
What's next
Explore data visualization using the Kandy library.
Find additional information about data visualization in Data visualization in Kotlin Notebook with Kandy.
For an extensive overview of tools and resources available for data science and analysis in Kotlin, see Kotlin and Java libraries for data analysis.