h2o-python

来源:互联网 发布:彩虹六号枪械数据 编辑:程序博客网 时间:2024/04/29 17:27

h2o Moudule

h2o.init(url=None, ip=None, port=None, https=None, insecure=None, username=None, password=None, cookies=None, proxy=None, start_h2o=True, nthreads=-1, ice_root=None, enable_assertions=True, max_mem_size=None, min_mem_size=None, strict_version_check=None, ignore_config=False, **kwargs)

Attempt to connect to a local server, or if not successful start a new server and connect to it.

Parameters:

  • url – Full URL of the server to connect to (can be used instead of ip + port + https).
  • ip – The ip address (or host name) of the server where H2O is running.
  • port – Port number that H2O service is listening to.
  • https – Set to True to connect via https:// instead of http://.
  • insecure – When using https, setting this to True will disable SSL certificates verification.
  • username – Username and
  • password – Password for basic authentication.
  • cookies – Cookie (or list of) to add to each request.
  • proxy – Proxy server address.
  • start_h2o – If False, do not attempt to start an h2o server when connection to an existing one failed.
  • nthreads – “Number of threads” option when launching a new h2o server.
  • ice_root – Directory for temporary files for the new h2o server.
  • enable_assertions – Enable assertions in Java for the new h2o server.
  • max_mem_size – Maximum memory to use for the new h2o server.
  • min_mem_size – Minimum memory to use for the new h2o server.
  • strict_version_check – If True, an error will be raised if the client and server versions don’t match.
  • ignore_config – Indicates whether a processing of a .h2oconfig file should be conducted or not. Default value is False.
  • kwargs – (all other deprecated attributes)

h2o.upload_file(path, destination_frame=None, header=0, sep=None, col_names=None, col_types=None, na_strings=None)

Upload a dataset from the provided local path to the H2O cluster.Does a single-threaded push to H2O. Also see import_file().

Parameters:

  • path – A path specifying the location of the data to upload.
  • destination_frame – The unique hex key assigned to the imported file.
    If none is given, a key will be automatically generated.
  • header – -1 means the first line is data, 0 means guess, 1 means first line is
    header.
  • sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
  • col_names – A list of column names for the file.
  • col_types – A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
    • “unknown” - this will force the column to be parsed as all NA
    • “uuid” - the values in the column must be true UUID or will be parsed as NA
    • “string” - force the column to be parsed as a string
    • “numeric”- force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner. “
    • enum” - force the column to be parsed as a categorical column.
    • “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats: (date) “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy”, (time) “HH:mm:ss”,“HH:mm:ss:SSS”,“HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”,“HH.mm.ss.SSSnnnnnn”. Times can also contain “AM” or “PM”.
  • na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.

h2o.import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None)

Import a dataset that is already on the cluster.
The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed multi-threaded pull of the data. The main difference between this method and upload_file() is that the latter works with local files, whereas this method imports remote files (i.e. files local to the server). If you running H2O server on your own maching, then both methods behave the same.

同上

  • pattern – Character string containing a regular expression to match file(s) in the folder if path is a directory.

Examples:

# Single file import
iris = import_file(“h2o-3/smalldata/iris.csv”)
# Return all files in the folder iris/ matching the regex r”iris_.*.csv”
iris_pattern = h2o.import_file(path = “h2o-3/smalldata/iris”, pattern =”iris_.*.csv”)

Data In H2O

Loading Data From A CSV File

Load data using either h2o.import_file or h2o.upload_file.

h2o.import_file uses cluster-relative names and ingests data in parallel.

h2o.upload_file uses Python client-relative names and single-threaded file upload from the client.

Loading Data From A Python Object

To transfer the data that are stored in python data structures to H2O, use the H2OFrame constructor and the python_obj argument. Additionally, from_python performs the same function but provides a few more options for how H2O will parse the data.

The following types are permissible for python_obj:

  • tuple ()
  • list []
  • dict {}
  • collections.OrderedDict
  • numpy.ndarray
  • pandas.DataFrame

Loading A Python Tuple

Essentially, the tuple is an immutable list. This immutability does not map to the H2OFrame. So Pythonistas beware!

The restrictions on what goes inside the tuple are fairly relaxed, but if they are not recognized, a ValueError is raised.

A tuple is formatted as follows:

(i1, i2, i3, …, iN)
Restrictions are mainly on the types of the individual iJ (1 <= J <= N). Here N is the number of rows in the column represented by this tuple.

If iJ is {} for some J, then a ValueError is raised. If iJ is a () (tuple) or [] (list), then iJ must be a () or [] for all J; otherwise a ValueError is raised. In other words, any mixing of types will result in a

Additionally, only a single layer of nesting is allowed: if iJ is a () or [], and if it contains any () or [], then a ValueError is raised.

If iJ is not a () or [], then it must be of type string or a non-complex numeric type (float or int). In other words, if iJ is not a tuple, list, string, float, or int, for some J, then a ValueError is raised.

Data Manipulation

H2OFrame

class h2o.frame.H2OFrame(python_obj=None, destination_frame=None, header=0, separator=u’, ‘, column_names=None, column_types=None, na_strings=None)

Primary data store for H2O.

H2OFrame is similar to pandas’ DataFrame of the critical distinction
is that the data is generally not held in memory, instead it is
located on a (possibly remote) H2O cluster, and thus H2OFrame
represents a mere handle to that data.


asfactor

asfactor()
Convert columns in the current frame to categoricals.
Returns: new H2OFrame with columns of the “enum” type.


describe

describe(chunk_summary=False)
Generate an in-depth description of this H2OFrame.

This will print to the console the dimensions of the frame;
names/types/summary statistics for each column; and finally first ten
rows of the frame.

Parameters: chunk_summary (bool) – Retrieve the chunk summary along
with the distribution summary


impute

impute(column=-1, method=u’mean’, combine_method=u’interpolate’,
by=None, group_by_frame=None, values=None)[source] Impute missing
values into the frame, modifying it in-place.

Parameters:

column (int) – Index of the column to impute, or -1 to
method (str) – The method of imputation:”mean”, “median”, or “mode”.
combine_method (str) – When the method is “median”, this setting
dictates how to combine quantiles for even samples. One of “interpolate”,
“average”, “low”, “high”.
by – The list of columns to group on.
group_by_frame (H2OFrame) – Impute the values with this pre-computed
values (List) – The list of impute values, one per column. None indicates to
skip the column.

Returns: A list of values used in the imputation or the group-by
result used in imputation.


mean

mean(skipna=True, axis=0, **kwargs)[source] Compute the frame’s means
by-column (or by-row).

Parameters:
skipna (bool) – If True (default), then NAs are ignored during the
computation. Otherwise presence of NAs renders the entirer result NA.
axis (int) – Direction of mean computation. If 0 (default),
then mean is computed columnwise, and the result is a frame with 1 row
and number of columns as in the original frame. If 1, then mean is
computed rowwise and the result is a frame with 1 column (called
“mean”), and number of rows equal to the number of rows in the
original frame.
Returns: either a list of mean values per-column (old
semantic); or an H2OFrame containing mean values per-column/per-row
from the original frame (new semantic). The new semantic is triggered
by either providing the return_frame=True parameter, or having the
general.allow_breaking_changed config option turned on.


runif

runif(seed=None) Generate a column of random numbers drawn
from a uniform distribution [0,1) and having the same data layout as
the source frame.

Parameters: seed (int) – seed for the random number generator.
Returns: Single-column H2OFrame filled with doubles sampled uniformly from [0,1).


split_frame

split_frame(ratios=None, destination_frames=None, seed=None)

Split a frame into distinct subsets of size determined by the given ratios.
The number of subsets is always 1 more than the number of ratios given. Note
that this does not give an exact split. H2O is designed to be efficient on big
data using a probabilistic splitting method rather than an exact split. For
example when specifying a split of 0.75/0.25, H2O will produce a test/train
split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On
small datasets, the sizes of the resulting splits will deviate from the expected
value more than on big data, where they will be very close to exact.
Parameters:

ratios (List[float]) – The fractions of rows for each split.
destination_frames (List[str]) – The names of the split frames.
seed (int) – seed for the random number generator

GroupBy

class h2o.group_by.GroupBy(fr, by)
A class that represents the group by operation on an H2OFrame.
The returned groups are sorted by the natural group-by column sort.

Parameters:
fr (H2OFrame) – H2OFrame that you want the group by operation to be
performed on.
by – by can be a column name (str) or an index (int) of a single column, or a
list for multiple columns denoting the set of columns to group by.

Example:

my_frame = ...  # some existing H2OFramegrouped = my_frame.group_by(by=["C1", "C2"])grouped.sum(col="X1", na="all").mean(col="X5", na="all").max()grouped.get_frame()

Any number of aggregations may be chained together in this manner.
Note that once the aggregation operations are complete, calling the
GroupBy object with a new set of aggregations will yield no effect.
You must generate a new GroupBy object in order to apply a new
aggregation on it. In addition, certain aggregations are only defined
for numerical or categorical columns. An error will be thrown for
calling aggregation on the wrong data types.

If no arguments are given to the aggregation (e.g. “max” in the above
example), then it is assumed that the aggregation should apply to all
columns but the group by columns.

All GroupBy aggregations take parameter na, which controls treatment
of NA values during the calculation. It can be one of:

“all” (default) – any NAs are used in the calculation as-is; which usually results in the final result being NA too.
“ignore” – NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example, mean([1, 2, 3, nan], na=”ignore”) will produce 1.5.
“rm”entries are skipped during the calculations, reducing the total effective
count of entries. For example, mean([1, 2, 3, nan], na=”rm”) will produce 2.

Variance (var) and standard deviation (sd) are the sample (not population) statistics.


Modeling In H2O

Supervised

原创粉丝点击