gcubed.data.database

This module contains the Database class. It can be rebased to different years and it provides access to the variables values for historical years.

class Database(gcubed.base.Base):

Overview

Provides convenience methods for all classes.

All G-Cubed classes inherit from this base class.

Database(sym_data: gcubed.sym_data.SymData)

Overview

The database class is used directly but it is also subclassed to support specific data usage scenarios.

It encapsulates all of the information about the database of values for all variables across a range of years.

Arguments

sym_data: The data about the model, created by the SYM processor. This also provides access to the model configuration.

The SYM processor output

The model configuration

data: pandas.core.frame.DataFrame

The data itself, contained in a dataframe with columns indexed by 4 digit (YYYY) year strings and with the rows indexed by variable names.

variables_count: int

The number of variables in the database.

years_count: int

The number of years in the database.

years_column_names: pandas.core.indexes.base.Index

The year column names for the data.

base_year: int

The (YYYY) format base year for the data. All indexes in the database are based in the specified year. Databases (but not database subclasses) can be rebased to different years.

first_available_year: int

The first year of data in the database.

last_available_year: int

The last year of data in the database.

def store_data( self, data: pandas.core.frame.DataFrame, database_variable_names: pandas.core.indexes.base.Index = None):

Overview

Store the data in the database, dropping the columns that are not years.

Arguments

data: The data to store in the database.

database_variable_names: The names of the variables in the database. When provided, this is used to do validation of the database variables against the SYM model variables.

def export_to_csv(self, filename: str):

Export the database to a CSV file, making sure that the file extension is '.csv'.

def rebase(self, new_base_year: int):

Rebase a database so indices have a new base year. This can be used to convert the database used for calibration to a database with the base year equal to the start year for projections (eg. 2011 to 2018).

Note that this script draws on the approach in the G-Cubed utilities/rebasedata.ox script.

Arguments

 new_base_year (int): a YYYY formatted new base year for the database.

Exceptions

Exception is thrown if the database does not contain data for the new base year.

Exception is thrown if the database does not contain data for the year after the new base year if the model has lagged index variables.

def rhs_vector_value( self, vector_name: str, year: int, use_neutral_real_interest_rate=False) -> numpy.ndarray:

Overview

Retrieves data from the database for all of the variables in a specific RHS vector in the model. The data is retrieved for the specified year.

Note that some state variables have their data retrieved for the following year.

Note also that interest rate values can be overridden by the globally defined neutral real interest rate that is set in the model configuration file.

The implementation steps are:

  1. get the rows for the variables of the given type in varmap.
  2. get the names of the variables in those rows from varmap.
  3. use those names to select the data from the calibration year database.
  4. set the values for those variables in the appropriate places in the vector to that data for that year using the indices specified in the varmap data.

Arguments

vector_name: The name of the vector to get the values for. This must be a RHS vector listed in the model's RHS vector names by the SymData class.

year: The YYYY format year to get data for when populating the RHS vectors. e.g. 2011 implies linearise model equations around the values of the model variables in 2011 (or in adjacent years for leads/lags).

use_neutral_real_interest_rate: True if interest rates are to be overridden with the model configuration neutral real interest rate and False otherwise.

Returns

A column vector with the requested values for the RHS vector or None if the vector has zero length.

def get_data_and_varmap_indices(self, vector_name: str, year: int) -> tuple:

Arguments

vector_name: The three character name of the vector that is to be populated with data.

year: the 4 digit integer specifying the year in the database that will be used to source the data that will be inserted into the named vector.

This method uses the varmap file created by the SYM processor, finding those rows in the varmap that have a value in the var_type column that match the given vector_name, e.g. 'x1r'. The matching rows contain the variable names and their indices within the vector that has been named as an input to the function.

The variable names are used to determine the rows of the database where the data will be sourced.

The year determines the column in the database where the data will be sourced.

Returns

A tuple is returned. That tuple contains a numpy vector of the data that has been extracted from the database and a vector of indices indicating where, in the specified vector, that data should be inserted.

def get_data_and_varmap_indices_for_matching_variables(self, variable_prefix: str, vector_name: str, year: int):

Gets matching data for the given variable prefix for a given vector.

Arguments

variable_prefix: The prefix for the variable name

vector_name: the name of the vector to be populated.

### Returns

A tuple containing the indices in the vector to be populated (as a list of integers) and the values to use to do the populating as a numpy column vector.

def get_data( self, name_regular_expression: str, years: list) -> pandas.core.frame.DataFrame:

Gets data for the set of variables with variable names that match the given regular expression.

Arguments

name_regular_expression: The variable selection criteria. It can be any regular expression that works with the Python regex package.

years: the list of years for which the data is to be retrieved. Note that this can be a list of integer values or a list of strings.

Returns

A copy of the data for the specified year for all variables with names matching the given regular expression where the names are matched against the row index (labels) in the database.

def get_data_for_variables_with_prefix(self, prefix: str, years: list = None) -> pandas.core.frame.DataFrame:

Gets a copy of the data for the set of variables with variable names that have the given variable name prefix (the part of the name up to but not including the part in brackets).

Arguments

  • prefix: The variable name prefix

  • years: the list of years for which the data is to be retrieved. Note that this can be a list of integer values or a list of strings. This argument defaults to None, in which case, all years of data are returned.

Returns

A copy of the data for the specified years for all variables with names that have the given prefix.

Exceptions

  • An exception is raised if the prefix is None, is not a string or has a length of zero.

  • An exception is raised if no variables have the given prefix.

  • An exception is raised if the years are not valid database years.

  • An exception is raised if the number of variables in the database does not match the number of variables in the SYM model.

def update_data(self, new_data: pandas.core.frame.DataFrame):

Replace the existing data property with a new dataframe. All of the data is replaced.

This is useful if you need to do projections and then treat those projections as actual data in a subsequent step in your analysis pipeline.

Arguments

new_data (pd.DataFrame)`: The new dataframe to use.

def has_data(self, year: int) -> bool:

Used to check if there is a column of data in the database for the specified year.

Arguments:

year: The 4 digit integer value of the year (YYYY).

Returns

True if the database has data for the specified year and False otherwise

has_data_for_all_projection_years: bool

This property is used when determining whether the database has been populated with projections, in which case those projections provide values, now stored as data, out to the end year of the projections.

Returns

True if the database has data for the last projection year and False otherwise.

def set_up_gdp_ratio_scaling_factor(self):

Overview

Sets up the scaling factor for variables that have units that end in a gdp suffix but not a usgdp suffix.

These scaling factors can then be used to do variable scaling as we convert between database values and values used to evaluate model equations.

def gdp_ratio_scaling_factor(self, year: int) -> pandas.core.series.Series:

Overview

Retrieves a series of gdp ratio scaling factors for all variables in the database. The values are:

  • 1 for variables that do not need to be scaled before use in the model
  • the ratio of local nominal GDP to USA nominal GDP (both measured in billions of USD) in the chosen year otherwise.

Arguments

  • year: The database year for which the scaling factors are being retrieved.

Returns

The series of values to use for scaling so that variables are a fraction of US GDP rather than local GDP, where that is appropriate. This series is suitable for broadcasting across the database.

Exceptions

  • An exception is raised if the GDP ratio scaling factors have not been set up in the database.

  • An exception is raised if the year is not in the database when retrieving the GDP ratio scaling factor.

  • An exception is raised if the year is not an integer in YYYY format.

def gdp_ratio_scaling_factor_for_variable( self, variable_name: str, year: <built-in function any>) -> pandas.core.frame.DataFrame:

Overview

Get the GDP ratio scaling factor for the specified variable in the specified year.

Arguments

variable_name: The full name of the variable.

year: The year for which the GDP ratio scaling factor is being retrieved. It can be a string or an integer. It is converted to a string when used.

Returns

A floating value that is the scaling factor that, when multiplied by the database value, converts it from a percentage of local GDP to a percentage of US GDP.

def value(self, variable_name: str, year: <built-in function any>) -> float:

Overview

Retrieves the value of a variable in the database for a specific year.

Arguments

  • variable_name: The full name of the variable.

  • year: The year for which the value is to be retrieved. It can be a string or an integer. It is converted to a string when used.

Returns

The value of the variable for the specified year if the year is between the first and the last available years in the database, inclusive.

def set_value( self, variable_name: str, year: <built-in function any>, value: float):

Overview

Set the value of a variable in the database for a specific year.

Arguments

  • variable_name: The full name of the variable.

  • year: The year for which the value is to be set. It can be a string or an integer.

  • value: The value to set for the variable in the specified year.

def save(self, file_path: pathlib.Path):

Overview

Save the database to a CSV file.

Arguments

  • filepath: The path to the file where the database is to be saved.

Exceptions

  • AssertionError: If the filename is not an absolute path